摘要
由于传统聚类分析中文本相似度计算方法不适用于短文本,本文选用基于句子成分的相似度计算方法来计算微博文本之间的相似度。首先对文本进行句子划分,再通过句法分析获取微博的句子成分,选择构成句子成分的词语为特征词。利用知网计算两个微博文本之间相同成分词语的语义相似度,将语义相似度值按句子成分种类加权相加得到微博文本之间的相似度值。据此,构建文本相似矩阵,进行聚类分析,找到微博热点主题。最后,用实验证明本文方法的可行性。
Because the traditional clustering analysis is not applicable to short text, this article selectsthe sentence similarity computing method based on component to calculate similarity between short texts.We obtain sentence constituents by parsing, and choose the words constitute parts of the sentence as keywords. Then we calculate the semantic similarity between key words based on the Hownet. The similaritybetween the texts can calculate by weighted summing the semantic similarity between key words. Accord-ing to this, we can construct the text similarity matrix and do clustering analysis on it. At last, we can minethe hot topics of micro-blogs. Finally, the experiment proved the feasibility of the proposed method.
出处
《情报科学》
CSSCI
北大核心
2015年第11期44-47,56,共5页
Information Science
基金
国家自然科学基金项目(71273194)