摘要
[目的/意义]在基于向量空间模型的文本聚类中,文本相似度计算忽略特征项间语义关联,针对此问题,提出一种改进的语义文本相似度计算方法。[方法/过程]新方法利用维基百科知识库计算语义相关度,结合特征项在文本中的表示权重,构造文本相似度语义加权因子,并进行K-means文本聚类实验。[结果/结论]与传统的余弦相似度相比,改进后的语义文本相似度应用在文本聚类上,能有效提高聚类的准确度。[局限]语义相关度的计算没有对词语进行消歧处理。
[Objective / significance] This paper proposes an improved semantic text similarity computation method to solve the problem of feature terms semantic association deficiency in text similarity computation for text clustering based on Vector Space Model.[Methods / process] Firstly,the new method uses Wikipedia to compute the semantic relevance.Secondly,the paper combines the weight of feature item in the text to construct semantic weighting factor of text similarity,and carry on the experiment of Kmeans text clustering as well.[Results / conclusion] By comparing with the traditional cosine similarity,experimental results show that the improved semantic text similarity used in the text clustering can effectively improve the accuracy of clustering.[Limitations] Word sense disambiguation is ignored in the process of the feature terms semantic relevancy computation.
出处
《情报理论与实践》
CSSCI
北大核心
2016年第2期129-133,共5页
Information Studies:Theory & Application
基金
国家自然科学基金项目"基于复杂网络的中文文本语义相似度研究"的成果
项目编号:71373200