摘要
关键短语是文本信息的精简概括,能够代表文本的主题和核心观点。而关键短语的自动抽取更是自然语言处理和信息检索的重要任务之一。针对目前无监督方法自动抽取关键短语存在过度生成候选短语语义的问题,提出了一种将整数线性规划和短语语义相似度相结合的自动抽取算法。通过惩罚语义相似度高的候选短语实现目标函数的最大化,以此形成多样性的关键短语。实验利用TextRank和TFIDF算法在两种不同的语料集中分别产生候选短语,并利用提出的优化算法对候选短语的权值得分进行优化。最后将所提算法产生的优化结果与现有多个算法的结果进行了比较。实验结果表明,通过加入相似性度量的惩罚能够有效解决语义过度问题,并获取更多样的关键短语,其优化结果的P,R和F值均高于其他算法。
Keyphrases are the concise summary of text information,which can represent the main topics and the core ideas of texts.And the automatic extraction of key phrases is one of the important tasks for natural language processing and information retrieval.Aiming at the existing problem caused by semantic over-generation on candidate phrases with unsupervised method,this paper proposed an algorithm for automatic extraction of keyphrase by using integer linear programming(ILP)and similarity of candidate phrases,in which candidate phrases with high sematic similarity are punished for maximizing the object function to obtain diversified keyphrases.TextRand and TFIDF algorithms are applied in the proposed method to create candidate phrases based on two different corpus sets and the proposed optimization algorithm is utilized to optimize the weight scores of candidate phrases.Finally,the results of the proposed optimization algorithm is compared with the ones of baseline methods,and the experimental results show that the proposed method can solve the semantic over-generation problem effectively by punishing candidate phrases with high semantic similarity.Moreover,the optimization algorithm can obtain more diverse keyphrases and the optimized results of P,R and F value outperform the ones of baseline methods.
作者
李珊珊
陈黎
唐裕婷
王艺霖
于中华
LI Shan-shan;CHEN Li;TANG Yu-ting;WANG Yi-lin;YU Zhong-hua(College of Computer Science,Sichuan University,Chengdu 610065,China)
出处
《计算机科学》
CSCD
北大核心
2019年第B06期56-59,70,共5页
Computer Science
基金
四川省科技支撑项目(2014GZ0063)
四川省重点研发项目(2018GZ0182)资助
关键词
关键短语自动抽取
整数线性规划
语义过度生成
多样性
Automatic keyphrase extraction
Integer liner programming
Semantic over-generation
Diversity