摘要
随着信息技术和互联网的发展,以数字形式存储的科技论文数目急剧增加.如何对这些科技论文进行有效的分类变得十分迫切.本文针对科技论文是一种半结构化的文献这一特点,提出了一种利用论文中有限的元数据对论文进行分类的想法.实验证明,在只使用文章的元数据描述,如标题、关键词和摘要等信息时,可以取得与传统的基本文本全文信息分类方法近似的分类精度.在对以大量公式、字符为主的类别进行分类时,以元数据进行分类可以取得更加理想的效果.因元数据的尺寸远远小于全文的尺寸,该方法可极大地缩短分类的时间.
With the high- speed development of the Intemet and information technology, the number of digital technical literatures increases rapidly. Therefore to categorize them efficiently becomes an urgent need. This paper proposes a method based on metadata for technical literatures. The experimetnts show that even if we only utilize the metadata of papers, e. g. the titles, abstract and keywords of papers, the classification precision of the algorithms based on the metadata has been very closed the traditional ones, which use the full-text of papers.Since the size of metadata of a paper is much less than the size of the full text, the new model can enhance the efficiency of document classfication greatly when categorizing those classes with large numbers of formulae and character.
出处
《山东师范大学学报(自然科学版)》
CAS
2008年第3期41-43,共3页
Journal of Shandong Normal University(Natural Science)
关键词
科技论文
文体分类
元数据
分类精度
technical literature
text categorization
metadata
accuracy