摘要
通过研究有关基因的生物学文献特征,提出了一种能对生物基因文献进行自动标注与分类的方法.在K最邻近算法的基础上,采用了Chi-Square特征选择方案,并且在加权算法中突出了Chi-Square的选择特点.另外,采用文档逻辑分块法,将额外的生物受控词汇表中的信息所形成的向量直接引入到了分类算法中,以提高分类和标注的效果.实验表明,所提算法优于常用的单词频率/逆文档频率加权方法,其在文本检索大会(TREC)数据集上的分类、标注效果分别比TREC公布的最好结果提高了3.14%和4.12%.
Based on the K nearest neighbor algorithm, an improved method was proposed for selecting genes-related documents from biology literature, and then automatically annotating and classifying. The method employs the Chi-Square feature selection plan and highlights the Chi-Square selections in weighted calculations. Furthermore, the effect of classification and annotation was improved by dividing the documents into logical blocks and introducing additional vectors from biological resources MeSH into the classification algorithm directly. Experiment results show that the proposed method is better than the commonly used TFIDF (term frequency and inverse document frequency) weighting method, and the results tested on TREC (text retrieval conference) data sets are 3.14% higher in classification and 4. 13% higher in annotation comparing to the best results announced TREC.
出处
《西安交通大学学报》
EI
CAS
CSCD
北大核心
2008年第2期171-174,共4页
Journal of Xi'an Jiaotong University
基金
陕西省自然科学基金资助项目(2004F06)
"九八五"二期平台建设资助项目
关键词
基因本体
分类标注
最邻近算法
gene ontology
classification annotation
nearest neighbor algorithm