摘要
在向文献数据库发送检索提问后,用户检索到的往往是数量众多且线性排列的文献记录,如何进一步分类这些文献记录以方便用户使用是信息检索领域的重要课题之一。本文以一个比较狭小的主题(脊髓损伤)为文献查询提问,探索利用原数据库中提供的论文主题相似性信息对检索到的文献记录进行聚类的方法,并对每个类别赋予类别标签。本文①利用生物医学权威文献数据库Medline,分别检索PubMed中有关脊髓损伤的部分文献(源文献),实际操作中我们抽取近两年发表的有关脊髓损伤的1906篇文献中前50篇;②利用PubMed中的"相关文献"功能分别检索出源文献的相关文献(共5108篇),筛选出频次较高的相关文献(出现频次大于或等于5次,共31篇);③形成源文献和相关文献的关联矩阵,根据该矩阵对来源文献进行聚类分析;④分别采用人工分析和主题词的向量空间模型算法提取各类的文献内容或类标签,初步评价分类结果的正确性。经过基于相似性的聚类分析,可以将脊髓损伤的源文献分为3个大类,对比人工分析和主题词向量空间模型方法对来源文献的内容提取,二者基本相符。就本文研究涉及的主题而言,利用文献数据库中提供的论文相关性信息对检索结果进行再次分类的方法是可行的。
The literature database often returns large number of linear list of documents in response to a user' s query, how to further categorize them is an important task in information retrieval.We chose a relatively narrow topic(spinal cord injuries)as the query.We explored to cluster these literature records and assign the class labels to each cluster by using topic similarity provided by the database.The first 50 papers on spinal cord injuries and published in the recent two years (1906)were searched in the Medline database.We downloaded their related articles(5108)using PubMed"Related Articles"and selected those appeared more than 5 times(31)to form a correlative matrix with their source papers.The source papers were clustered into groups with hierarchical clustering algorithms.Then we evaluated the correctness of the classification by manually analyzing the contents of each cluster and by assigning the class labels with Space-Vector Model algorithm.Cluster Analysis based on similarity can divide source papers into three major categories.By comparing manually analyzing the contents of each cluster with automatic extracting the class labels,we found that the two methods were basically consistent.The re-classification method using relevant information provided by literature database is feasible for the topic in this paper.
出处
《情报学报》
CSSCI
北大核心
2011年第5期456-463,共8页
Journal of the China Society for Scientific and Technical Information
关键词
相关文献
文本分类
聚类分析
脊髓损伤
向量空间模型
特征项频率
文档频率
text categorization
related articles
cluster analysis
spinal cord injuries
vector space model
term frequency
document frequency