摘要
传统的文本聚类缺少语义信息,文本的特征向量高维稀疏,忽略了Web文本的特殊性。为了解决这些问题,提出一种Web中文文本聚类方法。在基于知网(HowNet)的概念空间基础上过滤非名词,分析文本中重要词汇的语义,对标签特征集与正文特征集进行特征集聚类,再利用改进的TF-IDF算法选取两个集合中的特征,最终将文本表示为选取的标签特征集与正文特征集的并集,降低了特征的维度,高效地表示了文本。通过实验验证了其有效性。
Traditional text clustering lacks the semantic information, its text eigenvector is high-dimension sparse, and ignores the particularity of the Web text. In order to solve these problems, we propose a Web Chinese text clustering method in this paper. On the basis HowNet-base concept space, the method filters the terms but nouns, analyses the semantics of the important words in the text, and carry out the feature set clustering on label feature set and text feature set. Then it uses the improved TF-IDF algorithm to select features from these two sets, and finally expresses the text as a union of the selected label feature set and text feature set. It reduces the dimensions of features, and expresses the text efficiently. Experimental results demonstrate its effectiveness.
出处
《计算机应用与软件》
CSCD
北大核心
2013年第12期222-225,287,共5页
Computer Applications and Software
关键词
WEB文本聚类
特征降维
知网
文本相似度
Web text clustering Feature dimension reduction HowNet Text simiIarity