基于概念层次的英文文本自动分类研究被引量：3

Research on Automatic Text Classification Methods Based on Concept Hierar chies

下载PDF

导出

摘要该文意在设计并且实现一个针对英文文本的自动归类以及检索系统,重点在于提高分类方法的准确率。自动文本分类系统中,一般来说文本内容是以N维特征空间的形式存储的,所以特征提取的方法和准确率极大地影响到分类结果的正确率。传统方法是基于词形的,并不考察词语的意义,忽略了同一意义下词形的多样性、不确定性以及词义之间的关系,尤其是上下位关系。该文提出的方法,在向量空间模型(VSM)的基础上,以“概念”为基础,同时考虑词义的上位关系,使得训练过程中可以从词语中提炼出更加概括性的信息,从而达到提高分类精度的目的。 This paper aims at designing and implementing an automatic classification and retrieval system for English documents,focusing on improving the result of the classification algorithm.The documents in an automatic text classification sys tem are represented by feature vectors,and the overall performance is dependent on the algorithm and its accuracy of feature selection.Conventional word-fo rm based automatic classification systems ignore all semantic information of th e words,so the diversity and indeterminacy of word-forms will harm the result .This paper proposes a new feature extraction algorithm,which is based on the Vector Space Model,and uses concepts as features,giving further consideration to the concepts' inter-phrase relativity,especially the hypernymy.The algori thm enables the extraction of more abstract concepts of a text,and thus improve s the classification result.

作者厉宇航罗振声程慕胜

机构地区清华大学人文学院计算语言学研究室

出处《计算机工程与应用》 CSCD 北大核心 2004年第11期75-77,共3页 Computer Engineering and Applications

关键词自动文本分类概念层次 VSM WORDNET Automatic text classification,Concepts hierarchy,VSM,WordNet

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1万敏,罗振声,季姮,高小云.基于概念统计的英文自动文摘研究[J].计算机工程与应用,2002,38(24):7-9. 被引量：9

二级参考文献8

1Grishman R,Macleod C,Meyers A.COMPLEX syntax:building a computational lexicon[C].In: Proceedings of COLING-94,1994
2DeJong G.Fast Skimming of News Stories:The FRUMP System[D].PhD thesis. 1978
3Edmundson H P.New methods in automatic extraction[J].Journal of the ACM, 1968; 16(2)
4Kupiec J,Pedersen J,Chen F.A trainable document summarizer[C].In:Proceedings of the Eighteenth Annual International ACM Conference on Research and Development in Information Retrieval(SIGIR),1995
5郭玉箐,张旭平,罗振声.自动文摘中统计信息与文本结构自动分析初探[C].In:International Conference on Machine Translation ＆ Computer Language Information Processing,1999
6WAN Min,LUO Zhensheng,GUO Yuqing. Study on semantic paragraph partition in automatic abstracting system[C].In:Natural Language Processing and Knowledge Engineering(NLPKE)Mini Symposium of the 2001 IEEE International Conference on Systems, Man,and Cybernetics(SMC2001) ,2001
7Lin. Knowledge-based automatic topic identification[J].Information Processing and Management , 1997; 26 (1)
8郭玉箐,万敏,罗振声.面向非受限领域的综合式自动中文文摘方法[J].清华大学学报（自然科学版）,2002,42(1):139-142. 被引量：10

共引文献8

1刘茂福,李淑君,金可佳,张晓龙.多文档自动文摘中的特征组合优化[J].计算机系统应用,2008,17(8):59-63. 被引量：3
2王萌,何婷婷,张伟.基于概念向量空间模型的中文自动文摘系统[J].计算机工程与应用,2005,41(1):107-110. 被引量：5
3王志琪,王永成,刘传汉.论自动文摘及其分类[J].情报学报,2005,24(2):214-221. 被引量：2
4王萌,何婷婷,姬东鸿,王晓荣.基于HowNet概念获取的中文自动文摘系统[J].中文信息学报,2005,19(3):87-93. 被引量：22
5袁军鹏,朱东华,李毅,李连宏,黄进.文本挖掘技术研究进展[J].计算机应用研究,2006,23(2):1-4. 被引量：58
6官礼和.Internet网络新闻文本自动摘要的研究[J].计算机工程与设计,2007,28(14):3518-3520. 被引量：9
7乌庆敏,杨思春.概念向量空间模型在智能答疑系统中的应用[J].安徽工业大学学报（自然科学版）,2008,25(2):193-196. 被引量：3
8赵峰.基于SWN理论的关键字抽取策略[J].科技传播,2011,3(24):227-227.

同被引文献54

1徐妙君,顾沈明.面向Web的文本挖掘技术研究[J].控制工程,2003,10(z1):44-46. 被引量：4
2杨斌,孟志青.一种文本分类数据挖掘的技术[J].湘潭大学自然科学学报,2001,23(4):34-37. 被引量：10
3郑海,林鸿飞.基于段落匹配的文本分类机制[J].计算机工程与应用,2004,40(28):174-176. 被引量：3
4DavidHand HeikkiMarmila PadhraicSmyth 张银奎廖丽宋俊译.数据挖掘原理[M].机械工业出版社,2003..
5TomMMitchell.机器学习[M].北京：机械工业出版社,2003.263-276.
6刘群张华平俞鸿魁.基于层次隐马模型的汉语词法分析[Z].,2003..
7Salton G,Wong A,Yang C Sa. Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975,18(5 ) : 613-620.
8Bray T, Paoli J, Sperberg-McQaeen C M, Extcnsible Markup Language (XML) 1,0 Specification [EB/OL]. World Wide Web Consortium Recommendation, http://www.w3.org/TR/REC-xml,1998.
9Lassila O, Swick R R. Resource Description Framework Model and Syntax Specification [ EB/OL]. Workt Wide Web Consortium Recommendation, http ://www. w3. org/TR/REC-rdf-syntax/, 1999.
10Koller D, Sahami M. Hierarchically Classifying Documents Using Very Few Words[J]. ICML'97, 1997, 170-178.