基于小波分析的电子文献分类

Electronic Document Classification Based on Wavelet Analysis

下载PDF

导出

摘要文献数据的自动化分类，将在数字图书馆中占据越来越重要的地位。一般采用基于支持向量机的核方法，在标准测试集合上进行文献数据分类，具有某些不足。该方法存在文献向量规模庞大、核函数非正交且多义、重现率计算耗时等缺陷；不使用数字图书馆的真实数据测试，算法的实际说服力不强。为了解决这些问题，采用词汇扩展对文献向量进行预处理，得到少而精、正交无歧义的新文献向量；对文献向量按照语义排序，提高访问和计算速度；借助小波核将文献映射到L2空间进行文献分类。采用中国学术期刊网的真实分类数据，从摘要信息和全文文献两个角度进行验证，结果表明该方法优于核方法，具有一定的理论研究和实际应用价值。 The automatic document classification will play an important role in digital library（DL）. The common methods classify the standard test collections with the kernel method based on support vector machine （ SVM）. There are some drawbacks in this method, such as the large-scale document vectors, non-orthogonal and polysemous kernel function, time-consuming of calculating re-occurrence, low authority derived from not using real DL data. To solve these problems, term expansion is used to generate fewer but better, orthogonal and unambiguous document vectors. These new document vectors are carried out semantic ordering. The wavelet kernel is used to map the documents onto L2 space for classification. The real classification records in China National Knowledge Internet（CNKI） are used to validate this method in aspects of abstract and fulhext. From the experimental results, it can be seen that our method is better than kernel method.

作者张开选夏旭

机构地区山东大学图书馆南方医科大学图书馆

出处《情报学报》 CSSCI 北大核心 2013年第9期1000-1008,共9页 Journal of the China Society for Scientific and Technical Information

关键词电子文献分类机器学习支持向量机 L2空间小波分析 electronic document classification, machine learning, support vector machine, L2 space, wavelet analysis

分类号 G254 [文化科学—图书馆学]

引文网络
相关文献

参考文献32

1Paynter G W. Developing practical automatic metadata assignment and evaluation tools for internet resources [ C ]//Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries. New York: ACM, 2005 : 291-300.
2瞿靖,刘利萍,赵书城.MARC到其他元数据格式的数据复用软件[J].上海交通大学学报,2003,37(S1):243-246. 被引量：4
3Bethard S, Wetzer P, Butcher K, et al. Automatically characterizing resource quality for educational digital libraries [ C ]// Proceedings of JCDL-09, 9th joint international conference on digital libraries. New York: ACM, 2009:221-230.
4Martins W, Gonalves M, Laender A,et al. Learning to assess the quality of scientific conferences: a case study in computer science[ C] // Proceedings of JCDL-09, 9th joint international conference on digital libraries. New York : ACM ,2009 : 193-202.
5张铭,银平,邓志鸿,杨冬青.SVM+BiHMM:基于统计方法的元数据抽取混合模型[J].软件学报,2008,19(2):358-368. 被引量：27
6Hu Yunhua, Li Hang, Cao Yunbo, et al. Automatic extraction of titles from general documents using machine learning[ C ]//Proceedings of JCDL-05, 5th ACM/IEEE- CS joint conference on digital libraries. New York:ACM, 2005 : 145-154.
7Efron M, Elsas J, Marchionini G, et al. Machine learning for information architecture in a large governmental Web site[ C l// Proceedings of JCDL-04, 4th ACM/IEEE-CSjoint conference on digital libraries. New York: ACM, 2004 : 151-159.
8张玉芳,黄涛,艾东梅,熊忠阳,唐蓉君.Markov逻辑网在重复数据删除中的应用[J].重庆大学学报（自然科学版）,2010,33(8):36-41. 被引量：3
9Avancini H, Lavelli A, Sebastiani F, et al. Automatic expansion of domain-specific lexicons by term categorization [ J]. ACM Transaction on Speech and Language Processing ,2006,3 ( 1 ) : 1-30.
10Ramsey M C, Chen Hsinchun, Zhu Bin, et al. A collection of visual thesauri for browsing large collections of geographic images [ J ]. Journal of the American Society for Information Science, 1999,50 ( 9 ) : 826-834.

二级参考文献69

1刘涌泉.中国计算机和自然语言处理的新进展[J].情报科学,1987,8(1):64-70. 被引量：4
2陈振洲,李磊,姚正安.基于SVM的特征加权KNN算法[J].中山大学学报（自然科学版）,2005,44(1):17-20. 被引量：51
3战学刚林鸿飞等.中文文献的层次分类方法.上海交通大学OA室技术报告[M].,1999..
4刁倩王永成.中文信息自动分类的仿人算法.Proceedings of ICCIP’98,Nov[M].,1998..
5王永成.中文信息处理技术及其基础[M].上海:上海交通大学出版社,1992..
6DE R I., KERSTING K. Probabilistic logic learning [J]. ACM SIGKDD Explorations.. Special issue on Multi Relational Data Mining, 2003, 5(1): 31-48.
7DZEROSKI S. Relational data mining[M]. US: Springer, 2005:869-898.
8NEWCOMBE H B, KENNEDU J M, AXFORD S J, et al. Automatic linkage of vital records[J]. Science, 1959,130 : 954-959.
9FELLEGI I P, SUNTER A B. A theory for record linkage [J]. Journal of the American Statistical Association,1969, 64(328) : 1183-1210.
10AC-RESTI A. Categorical data analysis (2nd Edition) [M]. NewYork: Wiley, 2002: 372.

共引文献101

1郑继明,李瑞仙,蒲兴成.基于单状态HMM的音频分类方法研究[J].计算机应用,2009,29(2):392-394.
2李学勇,高国红,孙甲霞.基于互信息和K-means聚类的信息安全风险评估[J].河南师范大学学报（自然科学版）,2011,39(2):152-155.
3计雄飞,张宝林,王霞,魏利伟.专题服务方式探讨——以标准文献服务为例[J].标准科学,2014(2):29-32. 被引量：6
4王永成.Construction of Cubic Dynamic and User-oriented Taxonomy for Automatic Classification of Internet Information[J].High Technology Letters,2001,7(3):42-45. 被引量：1
5于津凯,王映雪,陈怀楚.一种基于N-Gram改进的文本特征提取算法[J].图书情报工作,2004,48(8):48-50. 被引量：17
6雷西川.文献资料分类专家系统研究[J].情报理论与实践,1993,16(6):40-41. 被引量：5
7周新栋,王挺.基于N元语言模型的文本分类方法[J].计算机应用,2005,25(1):11-13. 被引量：11
8吴淑娟.2002-2003年我国元数据研究综述[J].图书情报工作,2004,48(12):105-109. 被引量：6
9杨晔.网上教学资源挖掘与文本自动分类系统[J].广东工业大学学报,2005,22(2):79-82.
10苏新宁,徐进鸿,史九林.档案自动分类算法研究[J].情报学报,1995,14(3):194-200. 被引量：11

1夏薇.高校图书馆藏书结构正交优化研究[J].图书馆工作与研究,1993(1):48-52. 被引量：4
2邹瑛.图书分类工作中怎样利用图书在版编目的分类数据[J].图书馆论丛,2001(2):34-35. 被引量：1
3鞠福琴,孔为民.对图书馆OPAC分类数据的分析[J].图书馆学研究,2007(7):41-42.
4坦诺,C,李明明.CD—ROM上的全文文献[J].图书馆理论与实践,1994(1):61-62. 被引量：1
5黄红.用小波分析技术对缩微资料数字化处理初探[J].缩微技术,2002(3):38-40.
6冯秋荣,陈美儒.深度新闻报道的拓展[J].记者摇篮,2013(4):11-11.
7徐宏.多义与开放——试论电视剧的美学特征之一[J].现代传播（中国传媒大学学报）,1987,12(1):69-75.
8徐培汀.新闻事实倾向性[J].新闻界,1999(3):9-10. 被引量：1
9俞萌萌,颜丽丽.浅析电视新闻深度报道的误区[J].新闻传播,2007(2):57-57. 被引量：1
10张爱丽,刘清水,刘广利.高校图书馆效率的核评价方法[J].情报杂志,2003,22(10):103-103. 被引量：11

情报学报

2013年第9期

浏览历史

内容加载中请稍等...

基于小波分析的电子文献分类

参考文献32

二级参考文献69

共引文献101

相关作者

相关机构

相关主题

浏览历史