期刊文献+

词性对中英文文本聚类的影响研究 被引量:11

Influence of Part-of-Speech on Chinese and English Document Clustering
下载PDF
导出
摘要 不同词性特征在文本聚类中有不同的贡献度。该文对四组有代表性的中英文数据集,利用三种聚类算法验证了四种主要词性及其组合对中英文文本聚类的影响。实验结果表明,在中文和英文两种语言中,名词均是表征文本内容的最重要词性,动词、形容词和副词均对文本聚类结果有帮助,仅选择名词作为特征聚类的结果与保留所有词性聚类的结果相近,但可大大降低文本的维度;选用名词为文本特征不能实现最好的聚类效果;相对其他词性组合和单一词性,采用名词、动词、形容词和副词的组合特征往往可以实现更好的聚类效果。在词性所占的比例以及单一词性聚类的结果上,同一词性在中英文文本聚类中呈现出较大差异。相对于英文,不同词性特征及其组合在中文文本聚类中呈现的差异更为稳定。 Different part-of-speeches have different roles in document clustering.Using 4 popular English and Chinese datasets,the paper choose three clustering algorithms to investigate the influence of 4 major part-of-speeches as well as their combination on Chinese and English document clustering.The experimental result reveals that nouns are the most important in presenting the content of the document.Besides,verbs,adjectives and adverbs contribute to document clustering.Although similar result is obtained from the experiments,nouns.Using only nouns to characterize the document can not produce the best clustering result,but it can reduce the document dimensions to a great extent.The combination of 4 part-of-speeches produces the best clustering result.Single part-of-speech vary considerably in Chinese and English document clustering performance,and the differences are more consistent in Chinese document clustering.
出处 《中文信息学报》 CSCD 北大核心 2013年第2期65-73,共9页 Journal of Chinese Information Processing
基金 863计划项目“科技文献服务为主的搜索引擎研制”(2011AA01A206) 2011年南京大学研究生科研创新基金资助项目“中英双语文本聚类技术及其应用研究”(2011CW12)
关键词 词性标注 文本聚类 文本特征 part of speech tagging document clustering text feature
  • 相关文献

参考文献13

  • 1J Gimenez, L Marquez. Fast and accurate part-of- speech tagging: the SVM approach revisited[A]//Proceedings of the 4th RANLP, Bulgaria, 2003:158-165.
  • 2王丽杰,车万翔,刘挺.基于SVMTool的中文词性标注[J].中文信息学报,2009,23(4):16-21. 被引量:17
  • 3Y C Wu, J C Yang, Y S Lee. Description of the NCU Chinese Word Segmentation and Part-of-Speech Tagging for SIGHAN Bakeoff 2008[C]//Ptoceedings of the SIGHAN, 2008.
  • 4A Chen, Y Zhang, G Sun. A Two-Stage Approach to Chinese Part-of-Speech Tagging [C]//Proceedings of 6th SIGHAN Workshop on Chinese Language processing. Indian, 2007:82-85.
  • 5苏祺,昝红英,胡景贺,项锟.词性标注对信息检索系统性能的影响[J].中文信息学报,2005,19(2):58-65. 被引量:8
  • 6S Chua. The Role of Parts-of-Speech in Feature Selection[C]//Proceedings of the International MultiConference of Engineers and Computer Scientists. Hong Kong. 2008.
  • 7Z T Liu, W C Yu, Y L Deng. A Feature Selection Method for Document Clustering Based on Part-of- Speech and Word Co-Occurrence[C]//Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010). Yantai, China.
  • 8姚清耘,刘功申,李翔.基于向量空间模型的文本聚类算法[J].计算机工程,2008,34(18):39-41. 被引量:50
  • 9M Rosell. Part of speech tagging for text clustering in swedish[C]//Proceedings of the 17th Nordic Conference of Computational Linguistics. Odense, Denmark. 2009.
  • 10J L Sedding, D Kazakov. Wordnet-based text docu ment clustering[C]//Proeeedings of the Third Work shop on Robust Methods in Analysis of Natural Lan guage Data (ROMAND). Geneva, 2004:104-113.

二级参考文献21

共引文献72

同被引文献82

引证文献11

二级引证文献57

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部