摘要
针对RAKE(Rapid Automatic Keywords Extraction)算法在中文短文本关键词提取算法中未考虑词语语义和候选关键词过长的问题,提出一种以RAKE算法为基础的改进方法.在词语特征值计算阶段,利用词项距离、词间关系频率、共现频率构建共现矩阵,利用语境值计算公式计算每个候选关键词的特征值;按照特征值的降序输出候选关键词,若候选关键词词语个数超过n个,则利用窗口输出算法限制关键词的长度.实验表明,本文方法在中文短文本关键词提取方面相比RAKE算法及其它算法有更好的表现.
In order to solve the problem that RAKE(Rapid Automatic Keywords Extraction)does not consider the word semantics and the candidate Key words are too long,an improved algorithm based on RAKE method is proposed.In the eigenvalue calculation stage,the co-occurrence matrix is constructed by using the term distance,the frequency of inter-word relation and the co-occurrence frequency,and the eigenvalue of each candidate keyword is calculated by using the contextual value calculation formula.Candidate keywords are output in descending order according to the eigenvalues.If the number of candidate keyword words exceeds n,the window output algorithm is used to limit the length of keywords.Experiments show that the proposed method has better performance in extracting Chinese short text keywords than RAKE algorithm and other algorithms.
作者
陈可嘉
黄思翌
CHEN Ke-jia;HUANG Si-yi(School of Economics and Management,Fuzhou University,Fuzhou 350108,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2021年第6期1171-1175,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(71701019)资助.