摘要
为解决传统万有引力模型因词语质量、词间距离度量不足导致关键词效果较差的问题,分别从词语质量表示和距离计算两方面对传统万有引力模型进行改进。提出基于词频-文档分布熵的方法构建通用词表,过滤候选词后,综合位置、词性、词长特征改进TF-IDF方法,计算词语外部重要性;构建共现网络图,通过计算词语关联度衡量单词内部重要性,融合内部重要性和外部重要性计算词语质量并赋予图节点差异化初始权重;在语义距离的基础上引入依存句法距离,计算词间引力作为边的权重,多次迭代后排序输出TopK个关键词。实验结果表明,该方法在3GPP技术规范和公开的SemEval2010、DUC2001数据集上较传统方法取得了更好的效果,验证了方法的有效性和通用性。
To solve the problem of poor effects of the traditional gravitational model owing to improper word quality and distance measurement,the traditional universal gravitational model was improved from both the mass expression and the distance calculation perspectives.A method based on word frequency-document entropy to build a universal word list was proposed,after filtering candidate words,the features of position,part of speech and length were combined to improve TF-IDF,which was used to calculate the external importance of word.The co-occurrence network map was constructed,word’s internal importance was calculated by the word correlation degree,the internal and external importance were combined to express the word mass which was treated as the initial differential weight of the graph nodes.The dependency syntax distance was introduced based on the semantic distance,and the gravitational force was calculated as the weight of the edge.After multiple iterations,TopK key words were output.Experimental results show that the proposed method achieves better performance than the traditional methods in the3 GPP specification,the open SemEval2010 dataset and DUC2001 dataset,the validity and generality of the method are demonstrated.
作者
李欢
吕学强
李宝安
徐丽萍
LI Huan;LYU Xue-qiang;LI Bao-an;XU Li-ping(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing Information Science and Technology University,Beijing 100101,China;Beijing Research Center of Urban System Engineering,Beijing 100089,China)
出处
《计算机工程与设计》
北大核心
2019年第4期1091-1098,共8页
Computer Engineering and Design
基金
国家自然科学基金项目(61671070)
国家社会科学基金重大基金项目(15ZDB017)
国家语委重大课题基金项目(ZDA125-26)
北京成像技术高精尖创新中心基金项目(BAICIT-2016003)