摘要
现有关键词抽取算法缺乏对短语的有效表示,为抽取出更能反映文本主题的关键短语,本文提出一种基于短语向量的关键词抽取方法 PhraseVecRank.首先设计基于LSTM(Long Short-Term Memory)和CNN(Convolutional Neural Network)自编码器的短语向量构建模型,解决复杂短语的语义表示问题.然后,利用短语向量对每个候选短语计算主题权重,通过主题加权排序提高关键词抽取的效果.在公共数据集和学术论文数据上的实验表明,本文提出的方法能够有效提取与文本主题信息相关的关键短语,同时利用自编码器构造的短语向量可以更好地表示短语的语义信息.
Keyword extraction is a key basic problem in the field of natural language processing.The keyphrase extraction algorithms(PhraseVecRank)is proposed based on phrase embedding.Firstly,a phrase vector construction model based on LSTM(Long Short-Term Memory)and CNN(Convolutional Neural Network)is designed to solve the semantic representation of complex phrases.Then,PhraseVecRank uses phrase embedding to calculate theme weight for each candidate phrase,and uses semantic similarity between candidate phrase embedding and co-occurrence information to calculate edge weight together,which can improve the extraction effect of keyphrases through topic weighted ranking.The experimental results verify that PhraseVecRank can effectively extract keyphrases covering the topic information of text,and the phrase embedding models we proposed can better represent the semantic information of phrases.
作者
孙新
盖晨
申长虹
张颖捷
SUN Xin;GE Chen;SHEN Chang-hong;ZHANG Ying-jie(Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications,School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China;Beijing Institute of Technology Southeast Academy of Information Technology,Putian,Fujian 351100,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2021年第9期1682-1690,共9页
Acta Electronica Sinica
基金
国家重点研发计划项目(No.2017YFB0803300)。
关键词
短语向量
自编码器
主题加权
关键词抽取
phrase embedding
auto-encoder
theme-weighted
keyphrases extraction