期刊文献+

主题分析技术在文档聚类中的应用 被引量:2

Application of topic analysis in document clustering
下载PDF
导出
摘要 为解决高频特征对文章的主题信息反映不够全面,无法获得高质量聚类结果的问题,同时为获得聚类后各类别反映信息的精确描述,采用词汇链反映文章所描述的主题信息,并依据文本间词汇链的相似度进行聚类.将聚类后属于同一类别并反映相同主题信息的词汇链进行融合,通过分析各词汇链所描述的主题信息在不同类别内的分布来抽取能够充分反映各类别主题的关键词集合.实验证明该方法比应用高频特征进行聚类的效果好,同时由于分析了主题信息在各类别内的分布情况,使抽取的类别关键词能够很好地体现每个类别所侧重描述的信息. In order to solve the problem that features with high frequency can not be used to obtain clustering results with high quality cause of their incomplete reflection of the document' s topic, and to obtain the nice description of information described by each cluster, lexical chains were used to reflect the document' s topic and similar documents were clustered according to the similarity between different lexical chains. Then the lex- ical chains describing the same topic information in the same cluster were combined. Via analyzing the distri- bution of each topic clew among different clusters, the keyword sets that can completely reflect the topic of each cluster were extracted. Experimental results demonstrate that clustering results obtained by this method outperform those obtained by using features with high frequency to cluster documents, and the extracted keyword sets can reflect the emphasis information of each cluster,
出处 《哈尔滨工业大学学报》 EI CAS CSCD 北大核心 2009年第3期53-57,共5页 Journal of Harbin Institute of Technology
基金 国家自然科学基金重点资助项目(60435020) 国家高技术研究发展计划资助项目(2006AA01Z1972007AA01Z172)
关键词 知网 词汇链融合 主题层次聚类 hownet combination of lexical chains hierarchical clustering based on topic
  • 相关文献

参考文献11

  • 1JAIN A K, MURTY M N, FLYNN P J. Data clustering: a review [ J ]. ACM Computing Surveys, 1999, 31 (3) : 264 -323.
  • 2ANDREAS H. Wordnet improves text document clustering [ C]//Proceedings of the SIGIR 2003 Semantic Web Workshop. Toronto, Canada: ACM, 2003: 296-304.
  • 3MORRIS J, HIRST G. Lexical cohesion computed by thesaural relations as an indicator of the structure of text [J]. Computational Linguistics, 1991, 17(1) : 21 -48.
  • 4HASSAN A, AMAN K, MIKAKO N, et al. Structured and unstructured document summarization: design of a commercial summarizer using lexical chains [ C ]//Proceedings of the 7th International Conference on Document Analysis and Recognition. Edinburgh, Scotland, UK: IEEE Computer Society Press, 2003 : 1147 - 1150.
  • 5CHAN S W. Extraction of salient textual patterns: synergy between lexical cohesion and contextual coherence [ J]. IEEE Transactions on Systems, Man, and Cybernetics-Part A : Systems and Humans, 2004, 34 ( 2 ) : 205 -218.
  • 6GAN K W, WONG P W. Annotating information structures in chinese texts using hownet[ C ]//Proceedings of the Second Chinese Processing Workshop, Held in Conjunction with the 38th Annum Meeting of the Association for Computational Linguistics, HK, China: ACL, 2000 : 85 - 92.
  • 7李素建.基于语义计算的语句相关度研究[J].计算机工程与应用,2002,38(7):75-76. 被引量:83
  • 8刘群 李素建.基于《知网》的词汇语义相似度计算[C]..第三界汉语词汇语义研讨会[C].台北,2002..
  • 9KAUFMANN S. Cohesion and collocation: using context vectors in text segmentation [ C ]// Proceedings of the 37th Annual Meeting of the Association of for Computational Linguistics (Student Session ). College Park, USA: ACL, 1999:591 - 595.
  • 10JUNG Y. Design and evaluation of clustering criterion for optimal hierarchical agglomerative clustering [ D ]. Twin cities: University of Minnesota, 2001.

二级参考文献7

  • 1穗志文.基于骨架依存树的语句相似度计算模型[J].计算语言学文集,1998,(3):176-184.
  • 2[1]Chien Chin Chen, Meng Chang Chen,Yeali Sun. PVA: A Self-Adaptive Personal View Agent [J]. Journal of Intelligent Information Systems, 18:2/3, 173-194, 2002.
  • 3[2]Anandeep S. Pannu and Katia Sycara[J]. Learning Text Filtering Preferences.
  • 4[3]C. Burckley, A. Singhal, and M. Mitra. New retrieval approaches using SMART[C]. In: D. K, Harmann, editor, Proceedings of the Fourth Text Retrieval Conference (TREC-4), Gaithersburg,1996.
  • 5[4]S.E.Roberson and S.Walker,Okapi/ Keenbow at TREC8[C]. In: E.M. Voorhees and D.K.Harmann, editor,Proceedings of the Eighth Text Retrieval Conference(TREC-8),Gaithershurg,2000.
  • 6[5]Kjersti Aas and Line Eikvil. Text Categorization : A Survey,1999 [Z].
  • 7[6]Rong Jin , Christos Faloutsos and Alex G. Hauptmann Meta-scoring: Automatically Evaluating Term Weighting Schemes in IR without Precision -Recall [C]. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 83-89. ACM Press, 2001.

共引文献186

同被引文献19

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部