摘要
【目的】通过融合单个文档内部结构信息和文档整体的词向量关系进行关键词抽取。【方法】利用Word2vec将文档集中所有词汇进行向量表征,并且通过词向量计算词汇之间的相似度,进而对Text Rank算法进行改进,将候选关键词的权重按照词汇之间的相似度和邻接关系进行非均匀分配,并构建对应的概率转移矩阵用于词汇图模型的迭代计算以及关键词抽取。【结果】实现Word2vec与Text Rank的有效融合,且当训练文档集词汇分布合理时,关键词抽取效果较明显。【局限】需要进行成本较高的文档集训练,获取词向量以及词关系矩阵。【结论】文档集中的词关系有助于修正单文档内部的词关系,提升单文档的关键词抽取准确性。
[Objective] This study extracts keywords through combining the internal structure of each single document and the word vector of the corpus. [Methods] First, we used Word2vec to represent all words' vector from the document corpus and then calculated their similarities. Second, modified the TextRank algorithm and assigned weights to the keywords in accordance with their similarities and adjacency relations. Finally, we built a probability transfer matrix for the iterative calculation of the lexical graph model and then extracted keywords. [Results] The Word2vec and TextRank were integrated and extracted keywords effectively. [Limitations] The proposed method needs much training with the corpus to establish word vector and relation matrix. [Conclusions] The relationship among words from the document sets could help us modify the words relationship from a single document, and then increase the accuracy of extracting keywords from the individual document.
出处
《现代图书情报技术》
CSSCI
2016年第6期20-27,共8页
New Technology of Library and Information Service