期刊文献+

基于文本主题相似性的专题文献检索结果的聚类分析 被引量:4

Cluster Analysis of Retrieved Results for Specific Subjects Based on Text Topic Similarity
下载PDF
导出
摘要 在向文献数据库发送检索提问后,用户检索到的往往是数量众多且线性排列的文献记录,如何进一步分类这些文献记录以方便用户使用是信息检索领域的重要课题之一。本文以一个比较狭小的主题(脊髓损伤)为文献查询提问,探索利用原数据库中提供的论文主题相似性信息对检索到的文献记录进行聚类的方法,并对每个类别赋予类别标签。本文①利用生物医学权威文献数据库Medline,分别检索PubMed中有关脊髓损伤的部分文献(源文献),实际操作中我们抽取近两年发表的有关脊髓损伤的1906篇文献中前50篇;②利用PubMed中的"相关文献"功能分别检索出源文献的相关文献(共5108篇),筛选出频次较高的相关文献(出现频次大于或等于5次,共31篇);③形成源文献和相关文献的关联矩阵,根据该矩阵对来源文献进行聚类分析;④分别采用人工分析和主题词的向量空间模型算法提取各类的文献内容或类标签,初步评价分类结果的正确性。经过基于相似性的聚类分析,可以将脊髓损伤的源文献分为3个大类,对比人工分析和主题词向量空间模型方法对来源文献的内容提取,二者基本相符。就本文研究涉及的主题而言,利用文献数据库中提供的论文相关性信息对检索结果进行再次分类的方法是可行的。 The literature database often returns large number of linear list of documents in response to a user' s query, how to further categorize them is an important task in information retrieval.We chose a relatively narrow topic(spinal cord injuries)as the query.We explored to cluster these literature records and assign the class labels to each cluster by using topic similarity provided by the database.The first 50 papers on spinal cord injuries and published in the recent two years (1906)were searched in the Medline database.We downloaded their related articles(5108)using PubMed"Related Articles"and selected those appeared more than 5 times(31)to form a correlative matrix with their source papers.The source papers were clustered into groups with hierarchical clustering algorithms.Then we evaluated the correctness of the classification by manually analyzing the contents of each cluster and by assigning the class labels with Space-Vector Model algorithm.Cluster Analysis based on similarity can divide source papers into three major categories.By comparing manually analyzing the contents of each cluster with automatic extracting the class labels,we found that the two methods were basically consistent.The re-classification method using relevant information provided by literature database is feasible for the topic in this paper.
作者 王秀艳 崔雷
出处 《情报学报》 CSSCI 北大核心 2011年第5期456-463,共8页 Journal of the China Society for Scientific and Technical Information
关键词 相关文献 文本分类 聚类分析 脊髓损伤 向量空间模型 特征项频率 文档频率 text categorization related articles cluster analysis spinal cord injuries vector space model term frequency document frequency
  • 相关文献

参考文献17

  • 1Lin Y J, Li W Y, Chen K K, et al. A Document Clustering and Ranking System for Exploring MEDLINE Citations. JAMIA ,2007,14 ( 5 ) :651-661.
  • 2王志梅,张俊林,李秋山.Web检索结果快速聚类方法的研究与实现[J].计算机工程与设计,2004,25(12):2231-2233. 被引量:2
  • 3林海文.文本挖掘技术研究[J].电脑知识与技术,2008,3(12):1711-1712. 被引量:5
  • 4Lawrie D, Croft W B, Rosenberg A L. Finding Topic Words for Hierarchical Summarization [ C ]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIRO1 ) , New Orlean, LA, USA, 2001:249-357.
  • 5何建英,陈蓉,徐淼,刘佳,于中华.基于类别特征向量表示的中文文本分类算法[J].计算机应用研究,2008,25(2):337-338. 被引量:11
  • 6Cutting D R, Karger D R, Pederson J O. Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections [ C ]//Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR' 93 ), Pittsburgh, PA, 1993 : 125 -135.
  • 7Hearst M ,Pedersen P. Reexamining the Cluster Hypothesis: Scatter/gather on Retrieval Results [ C ]//Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR96 ) , Zurich, Switzerland, 1996:76-84.
  • 8Leouski A V, Croft W B. An Evaluation of Techniques for Clustering Search Results [ R ]. Technical Report IR-76,Department of Computer Science, University of Massachusetts. Amherst. 1996 : 1-19.
  • 9Zamir 0 ,Etzioni O. Web Document Clustering:a Feasibility Demonstration [ C ]//Proceedings of the 21 st International ACM SIGIR Conference on Research and Development in Information Retrieval ( SIGIR 98 ) , Melbourne, Australia, 1998:46-54.
  • 10Leuski A, Allan J. Improving Interactive Retrieval by Combining Ranked List and Clustering [ C ]// Proceedings of RIAO 2000 Conference, Paris, France, 2000 : 665-681.

二级参考文献65

  • 1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:96
  • 2王映,常毅,谭建龙,白硕.基于N元汉字串模型的文本表示和实时分类的研究与实现[J].计算机工程与应用,2005,41(5):88-91. 被引量:5
  • 3晏生宏,黄莉.英文易读度测量程序开发探索[J].重庆大学学报(社会科学版),2005,11(2):92-97. 被引量:37
  • 4薛为民,陆玉昌.文本挖掘技术研究[J].北京联合大学学报,2005,19(4):59-63. 被引量:63
  • 5曾致远,张莉.基于向量空间模型的网页文本表示改进算法[J].计算机工程,2006,32(3):134-135. 被引量:10
  • 6Popescul A, Ungar L. Automatic Labeling of Document Clusters. [EB/OL]. [ 2007 -01 - 10 ]. http://www, cis. upenn, edu/- popescul/Publications/popescul001abeling, pdf.
  • 7Pucktada T, Jamie C. Automatically Labeling Hierarchical Clusters [ C ]. In : Proceedings of the 2006 International Conference on Digital government research, San Diego, CA, USA, 2006:167-176.
  • 8Maqbool O, Babfi H A. Interpreting Clustering Results through Cluster Labeling [ C ]. In : Proceedings of the IEEE International Conference on Emerging Technologies ( 1CET05 ), lslamabad, Pakistan, 2005 : 429 - 434.
  • 9Stein B, Meyer zu Eissen S. Topic Identification: Framework and Application [ C ]. In :Proceedings of the 4th International Conference on Knowledge Management ( I- KNOW 04 ), Graz, Austria, 2004: 353 - 360.
  • 10Lawrie D, Croft W B, Rosenberg A L. Finding Topic Words for Hierarchical Summarization [ C ]. In:Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'OI), New Orlean, LA, USA, 2001: 249 - 357.

共引文献16

同被引文献25

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部