基于一种混合语言模型的自动文本分类技术研究被引量：2

Research on Automatic Text Classification Based on a Hybrid Language Model

下载PDF

导出

摘要随着Internet以及Intranet中大量可利用信息的爆炸式增长,文本分类成为处理和组织大量文档数据的关键技术之一。该文提出一种本体论和统计方法相结合的混合语言模型,用以解决自动文本分类问题。首先,通过学习不同类别的训练语料,分别获得各自类别的语言本体知识库,构造成为不同类别的分类器。对于实际文档,将基于不同类别的语言本体知识库分别获得对文档的评价值,并以所获得的最高评价值决定该文档的类别归属。与Bayes,k-nearest neighbor,support vector machine等3种典型的文本分类器进行了比较。实验结果表明,该文方法的分类性能均胜于其上述3种方法。 With the volume of information available increase, text classification has become one of the key on the Internet and corporate intranets continues to technology in organizing and processing large amount of document data. This paper gives a novel method of Chinese text categorization based on a combination of ontology with statistical method. In this study, first, linguistic ontology knowledge bank will be respectively acquired by learning training corpus for various classes to determine the various categorizations. For a actual document, the evaluation value will respectively be gotten by various linguistic ontology knowledge bank and the categorization will be judged by the highest evaluation value. This method is compared with Bayes, k-nearest neighbor and support vector machine, The primary experimental results show that the method outperforms that previous work.

作者郑德权李生赵铁军于浩

机构地区哈尔滨工业大学语言语音教育部-微软重点实验室

出处《电子与信息学报》 EI CSCD 北大核心 2007年第3期601-605,共5页 Journal of Electronics & Information Technology

基金国家自然科学基金(60302021) 黑龙江省自然科学基金(F2004-04)资助课题

关键词文本分类水体混合语言模型上下文多元信息 Text classification Ontology Hybrid language model Context Multi-grams

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献17

1Lewis D D.Naive (Bayes) at forty:The independence assumption in information retrieval.The 10th European Conference on Machine Learning,New York,1998:4-15.
2Yang Y M and Liu X.A re-examination of text categorization methods.The 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval,New York,1999:42-49.
3Joachims T.Text categorization with support vector machines:Learning with many relevant features.The 10th European Conference on Machine Learning,New York,1998:137-142.
4Lewis D and Ringuette M.A comparison of two learning algorithms for text classification.Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval,1994:81-93.
5Wiener E,Pedersen J O,and Weigend A S.A neural network approach to topic spotting.Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval,Las Vegas,NV,1995:22-34.
6Schapire R E and Singer Y.Improved boosting algorithms using confidence-rated predications.In:The 11th Annual Conference on Computational Learning Theory,Madison,1998:80-91.
7Yang Y M.An evaluation of statistical approaches to text categorization.Information Retrieval,1999,1(1):76-88.
8Jelinek F.Self-organized language modeling for speech recognition.Readings in Speech Recognition,A.Waibel and K.F.Lee,eds.Morgan-Kaufmann,San Mateo,CA,1990:450-506.
9Brown P,Pietra S D,Pietra V D,and Mercer R.The mathematics of statistical machinetranslation:Parameter estimation.Computational Linguistics,1993,19(2):269-311.
10Croft W B and Laffcrty J.Language Modeling for Information Retrieval.Kluwer Academic Publishers,Boston,Massachusetts,2003.

共引文献15

1吴志峰,田学东.基于概念的文本分类中的人名、地名处理研究[J].微机发展,2005,15(3):11-13.
2孟祥增,王玲,李海霞,钟义信.一种利用颜色词检索图像的方法[J].中国图象图形学报（A辑）,2005,10(3):349-353. 被引量：8
3李良富,樊孝忠,李宏乔.语义相似计算驱动领域自动问答[J].北京理工大学学报,2005,25(11):958-962. 被引量：5
4张瑾,刘亚清,于纯妍.汉语词义排歧的另一种方法[J].小型微型计算机系统,2006,27(4):724-726. 被引量：1
5张亮,陈肇雄,黄河燕.问题分类的计算模型研究[J].计算机科学,2006,33(4):9-12. 被引量：1
6谭翀,陈跃新.自动摘要方法综述[J].情报学报,2008,27(1):62-68. 被引量：9
7乔晓东,张运良,朱礼军.汉语科技词系统建设与应用进展[J].情报学报,2010,29(6):978-986. 被引量：5
8程传鹏.网络评价倾向性研究[J].计算机工程与应用,2011,47(25):156-159. 被引量：2
9张晓孪,王西锋.基于知网和知识图的汉语词语语义相似度算法[J].计算机与数字工程,2011,39(10):72-76. 被引量：2
10陶红,周永梅,高尚.一种基于语义相似度的群智能文本聚类的新方法[J].计算机应用研究,2012,29(2):482-484. 被引量：3

同被引文献25

1唐爱民,真溱,樊静.基于叙词表的领域本体构建研究[J].现代图书情报技术,2005(4):1-5. 被引量：43
2张继东,余以胜.利用叙词表构建本体的方法研究[J].图书情报知识,2006,23(4):82-85. 被引量：23
3贾君枝.《汉语主题词表》转换为本体的思考[J].中国图书馆学报,2007,33(4):41-44. 被引量：24
4Moschitti A,Basili R. Complex linguistic features for text classification : A comprehensive study [ C ]//McDonald S, Tait J. Proceedings of the ECIR-04. Sunderland: Springer-Verlag. Sunderland, U. K. ,2004 : 181-196.
5Garcia V, Alejo R, Sanchez J S,et. al. Combined effects of class imbalance and class overlap on instance-based classification [ C ]. IDEAL, 2006 : 371-378.
6Orriols A, Bernardo E. The class imbalance problem in learning classifier systems: a preliminary study [ C ]// Proc. Conf. on Genetic and Evolutionary Computation, 2005:74-78.
7Prati R C, Batista G E, Monard M. C:Class imbalance versus class overlapping an analysis of a learning system behavior[ C]//Proc. 3rd Mexican Intl. Conference on Artificial Intelligence,2004:312-321.
8Li R L, Hu Y F. Nosice reduction to text catego-rization based on density for KNN [ C ]//Proceedings of the 2^nd International Conference on Machine Learning and Cybernetics. Xi' an,2003:3119-3124.
9Zhou S G, Ling T W, Guan J H, et al. Fast Text Classification: A training-corpus pruning based approach [ C]//Proceedings of the 8th International Conference on Database Systems for Advanced Application. Los Alamitos : IEEE Computer Society ,2003 : 127-136.
10Dehmeshki J, Karakoy M, Casique M V. A rule-based scheme for filtering examples from majority class in an imbalanced training set [ C ]//Proceedings of MLDM 2003 : 215-223.

引证文献2

1林琛,李弼程,周杰.基于信息粒度的交叠类文本分类方法[J].情报学报,2011,30(4):339-346. 被引量：7
2张园.基于领域本体的档案信息检索系统构建研究[J].中国档案,2013(3):69-71. 被引量：4

二级引证文献11

1李湘东,何海红,曹环,黄莉.针对训练集分布偏斜问题的数字资源文本分类方法[J].现代图书情报技术,2014(7):24-33. 被引量：2
2李湘东,巴志超,黄莉.基于语料信息度量的文本分类性能影响研究[J].情报杂志,2014,33(9):157-162. 被引量：5
3李湘东,曹环,黄莉.文本分类中训练集相关数量指标的影响研究[J].计算机应用研究,2014,31(11):3324-3327. 被引量：6
4刘绍毓,周杰,李弼程,席耀一,唐浩浩.基于多分类SVM-KNN的实体关系抽取方法[J].数据采集与处理,2015,30(1):202-210. 被引量：20
5李湘东,曹环,黄莉.基于分布偏斜训练集的特征选择方法研究[J].情报理论与实践,2015,38(4):139-144. 被引量：2
6吕元智.数字档案资源跨媒体语义关联聚合实现策略研究[J].档案学研究,2015(5):60-65. 被引量：32
7杨冬梅,李昀,张妍,邹艳.文本与命名实体的联合主题模型[J].电信技术研究,2017,0(1):15-20.
8陈晨.档案信息检索系统的常见问题与开发利用[J].科技文献信息管理,2017,31(4):46-48. 被引量：2
9陈刚,李弼程,郭志刚,林琛.网络舆情监测预警系统模型与关键技术[J].信息工程大学学报,2019,20(1):116-121. 被引量：4
10房小可.档案资源检索研究综述——基于中外档案学学术刊物分析[J].山西档案,2019,0(6):163-171. 被引量：2

1周春耕,张秉权,黄河燕.基于混合语言模型的盲汉机器翻译系统的研究与实现[J].计算机工程与应用,2003,39(4):127-130. 被引量：3
2黄文学.基于图元的图文数据库管理系统[J].计算机系统应用,1995,4(4):9-11. 被引量：1
3谢旭东,丁晓青,彭良瑞,刘长松.一个基于混合语言模型的日文识别后处理系统[J].计算机工程与应用,2002,38(14):68-72.
4侯珺,王作英.一种词义与词的混合语言模型及其应用[J].中文信息学报,2001,15(6):7-12.
5萧林.打印时尚喷出精彩——2003年喷墨打印机发展及选购概述（之四）[J].大众软件,2003(15):70-72.
6李晓光,于戈,王大玲.基于混合语言模型的文档相似性计算模型[J].中文信息学报,2006,20(4):41-48. 被引量：2
7汪伟.价值决定成功[J].CAD/CAM与制造业信息化,2010(9):15-15.
8卫冰洁,王斌.面向微博搜索的时间感知的混合语言模型[J].计算机学报,2014,37(1):229-237. 被引量：12
9章森,刘磊,刁麓弘.基于混合语言模型的中文智能输入技术[J].北京工业大学学报,2007,33(9):997-1001.
10文贡坚,王润生.基于模糊决策的快速识别多类目标的方法[J].模式识别与人工智能,1997,10(2):106-111. 被引量：3

电子与信息学报

2007年第3期

浏览历史

内容加载中请稍等...

基于一种混合语言模型的自动文本分类技术研究被引量：2

参考文献17

共引文献15

同被引文献25

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于一种混合语言模型的自动文本分类技术研究 被引量：2

参考文献17

共引文献15

同被引文献25

引证文献2

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

基于一种混合语言模型的自动文本分类技术研究被引量：2