摘要
文献数据的自动化分类,将在数字图书馆中占据越来越重要的地位。一般采用基于支持向量机的核方法,在标准测试集合上进行文献数据分类,具有某些不足。该方法存在文献向量规模庞大、核函数非正交且多义、重现率计算耗时等缺陷;不使用数字图书馆的真实数据测试,算法的实际说服力不强。为了解决这些问题,采用词汇扩展对文献向量进行预处理,得到少而精、正交无歧义的新文献向量;对文献向量按照语义排序,提高访问和计算速度;借助小波核将文献映射到L2空间进行文献分类。采用中国学术期刊网的真实分类数据,从摘要信息和全文文献两个角度进行验证,结果表明该方法优于核方法,具有一定的理论研究和实际应用价值。
The automatic document classification will play an important role in digital library(DL). The common methods classify the standard test collections with the kernel method based on support vector machine ( SVM). There are some drawbacks in this method, such as the large-scale document vectors, non-orthogonal and polysemous kernel function, time-consuming of calculating re-occurrence, low authority derived from not using real DL data. To solve these problems, term expansion is used to generate fewer but better, orthogonal and unambiguous document vectors. These new document vectors are carried out semantic ordering. The wavelet kernel is used to map the documents onto L2 space for classification. The real classification records in China National Knowledge Internet(CNKI) are used to validate this method in aspects of abstract and fulhext. From the experimental results, it can be seen that our method is better than kernel method.
出处
《情报学报》
CSSCI
北大核心
2013年第9期1000-1008,共9页
Journal of the China Society for Scientific and Technical Information
关键词
电子文献分类机器学习
支持向量机
L2空间
小波分析
electronic document classification, machine learning, support vector machine, L2 space, wavelet analysis