摘要
不同词性特征在文本聚类中有不同的贡献度。该文对四组有代表性的中英文数据集,利用三种聚类算法验证了四种主要词性及其组合对中英文文本聚类的影响。实验结果表明,在中文和英文两种语言中,名词均是表征文本内容的最重要词性,动词、形容词和副词均对文本聚类结果有帮助,仅选择名词作为特征聚类的结果与保留所有词性聚类的结果相近,但可大大降低文本的维度;选用名词为文本特征不能实现最好的聚类效果;相对其他词性组合和单一词性,采用名词、动词、形容词和副词的组合特征往往可以实现更好的聚类效果。在词性所占的比例以及单一词性聚类的结果上,同一词性在中英文文本聚类中呈现出较大差异。相对于英文,不同词性特征及其组合在中文文本聚类中呈现的差异更为稳定。
Different part-of-speeches have different roles in document clustering.Using 4 popular English and Chinese datasets,the paper choose three clustering algorithms to investigate the influence of 4 major part-of-speeches as well as their combination on Chinese and English document clustering.The experimental result reveals that nouns are the most important in presenting the content of the document.Besides,verbs,adjectives and adverbs contribute to document clustering.Although similar result is obtained from the experiments,nouns.Using only nouns to characterize the document can not produce the best clustering result,but it can reduce the document dimensions to a great extent.The combination of 4 part-of-speeches produces the best clustering result.Single part-of-speech vary considerably in Chinese and English document clustering performance,and the differences are more consistent in Chinese document clustering.
出处
《中文信息学报》
CSCD
北大核心
2013年第2期65-73,共9页
Journal of Chinese Information Processing
基金
863计划项目“科技文献服务为主的搜索引擎研制”(2011AA01A206)
2011年南京大学研究生科研创新基金资助项目“中英双语文本聚类技术及其应用研究”(2011CW12)
关键词
词性标注
文本聚类
文本特征
part of speech tagging
document clustering
text feature