基于特征有序对量化表示的文本分类方法被引量：4

Text categorization algorithm based on feature order pair quantization

导出

摘要文本分类技术应尽可能包含语言中各种各样的约束信息，但目前常用的文本表示方法却忽视组成文本的语言特征顺序。该文采用基于聚类的方法实现语言特征有序对的快速量化表示，并由此导出新的基于特征有序对的文本表示方法以揭示文本中所呈现出的语言特征顺序信息。运用向量空间质心法，分别依据词对和词类对表示文本并在3个数据集上进行实验。结果表明性能优于基于单纯词或单纯词类的文本表示方法，宏平均F1值绝对提高分别为3％～4％和5％～7％（相对改善分别是4％～5％和8％～10％）。由此说明特征顺序信息对提升文本分类性能具有重要作用。 Text categorization algorithms should contain the various constraints presented in the language, but most neglect the order information of language feature in the text, This paper presents a document representation scheme based on feature pair quantization using clustering to identify feature order information in the text, that is then combined with the vector space centroid algorithm. Tests were done for representing documents based on word pairs and word sense pairs respectively in three different data sets. The results show that the current method outperform traditional representations based on words or word sense, The average improvement of Micro-F1 for word pairs is 3%-4% and for word sense pair is 5%- 7%. Therefore, feature order information plays an important role for improving text categorization performance.

作者任纪生王作英

机构地区清华大学电子工程系

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2006年第4期527-529,533,共4页 Journal of Tsinghua University(Science and Technology)

基金国家"八六三"高技术项目(2001AA114071)

关键词文本分类特征选择特征抽象特征变换奇异值分解 text categorization feature selection feature abstractions feature transformation singular value decomposition

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献6

1Salton G,Buckley C.Term weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,5:513-523.
2YANG Yiming,Pederson J O.A Comparative study on feature selection in text categorization[A].Proceedings of the 14th international conference on machine learning[C].Nashville:Morgan Kaufman,1997:412-420.
3Kehagias A,Petridis V,Kaburlasos V G,et al.A comparison of word and sense-based text categorization using several classification algorithms[J].Journal of intelligent information systems,2003,3:227-247.
4Deerwester S,Dumais S T,Furnas G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society of Information Science,1990,41:391-407.
5Berry M W.Large-scale sparse singular value computations[J].The International Journal of Supercomputer Applications,1992,6:13-49.
6Ney H,Essen U,Kneser R.On structuring probabilistic dependences in stochastic language modeling[J].Computer Speech and Language,1994,8:1-38.

同被引文献22

1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量：197
2谌志群,张国煊.文本挖掘研究进展[J].模式识别与人工智能,2005,18(1):65-74. 被引量：49
3Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features[C]//In Europearl Conference on Machine Learning (ECML). Chemnitz, Germany: [ s. n. ], 1998:137 - 142.
4Gartner T, Flach P A. WBCsvm Weighted Bayesian Classification based on support vector machine[ C]//18th Int. Conf. on Machine Learning. WiUianstown, USA: [ s. n. ], 2001 : 154 - 161.
5Sindhawani V, Pushpak B, Subrata R. Information Theoretic Feature Crediting in Multiclass Support Vector Machine[C]// 1st SIAM Int. Conf. on Data Mining. Chicago, IL, USA: [ s. n. ] ,2001:1 - 18.
6Lewis D D, Yang Y, Rose T, et al. RCV1 : A New Benchmark Collection for Text Categorization Research[ J ]. Journal of Machine Learning Research,2004(5) :361 - 397.
7Angheluta R,De Busser R,Moens M-F.The use of topic segmentation for automatic summarization In:Hahn U,Harman D,eds.Proceedings of the Workshop on Automatic Summarization.Philadelphia,Pennsylvania,USA,2002.66～70
8Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.Information Processing and Management,2004,40:65～79
9Sebastiani F.Machine Learning in Automated Text Categorization.ACM Computing Surveys,2002,34(1):1～47
10Wasikowski M, Xue-wen Chen. Combating the Small SampleClass Imbalance Problem Using Feature Selection[ J]. Knowl-edge and Date Engineering, 2010,22(10) :1388-1400.

引证文献4

1何中市,刘里.基于上下文关系的文本分类特征描述方法[J].计算机科学,2007,34(5):183-186. 被引量：6
2姜鹤,陈丽亚.SVM文本分类中一种新的特征提取方法[J].计算机技术与发展,2010,20(3):17-19. 被引量：16
3杨鸿骏,周亚建,郭玉翠.一种基于同义词扩展的不平衡文本分类方法[J].情报杂志,2013,32(9):204-206. 被引量：1
4杨鸿骏,周亚建,郭玉翠.一种基于同义词扩展的不平衡文本分类方法[J].北京电子科技学院学报,2014,22(2):6-10.

二级引证文献23

1郭少友.基于词语上下文关系的文本自动分类方法研究[J].现代图书情报技术,2008(5):44-49.
2陈笑筑,王东,陈笑蓉.基于页面标签的网页分类研究[J].商场现代化,2009(19):100-101. 被引量：2
3宋志辉.一种改进的特征选择方法[J].贵州教育学院学报,2009,25(6):54-56. 被引量：1
4张玉芳,杨芬,熊忠阳,陈小莉.基于上下文的领域本体概念和关系的提取[J].计算机应用研究,2010,27(1):74-76. 被引量：14
5孙荣,刘宗田,廖涛,王利.应用本体对特征向量降维研究[J].计算机工程与设计,2010,31(17):3864-3867. 被引量：4
6郭晓,蒋宗礼.基于网页结构与链接关系的中文文本分类方法[J].现代电子技术,2010,33(22):54-56. 被引量：3
7赵耀,陈志敏.上下文广告中的一种文本分类方法[J].扬州大学学报（自然科学版）,2011,14(4):43-46.
8刘文,吴陈.一种新的中文文本分类算法——One Class SVM-KNN算法[J].计算机技术与发展,2012,22(5):83-86. 被引量：4
9兰远东,邓辉舫.基于Kullback-Leibler与PCA的概率密度比值估计[J].计算机技术与发展,2012,22(6):107-110.
10闫巧,冷成朝.基于信息增益的混合垃圾邮件特征选择方法[J].计算机工程与应用,2012,48(27):90-93. 被引量：1

1梁荣华,李伟明,王子仁,毛剑飞,马祥音.特征抽象的直接体绘制方法[J].计算机辅助设计与图形学学报,2014,26(3):339-347. 被引量：5
2程俊霞,李芝棠,邹明光,肖津.基于SVM过滤的微博新闻话题检测方法[J].通信学报,2013,34(S2):74-78. 被引量：3
3李书田,郑联语,汪叔淳.集成环境中基于特征顺序的产品建模方法研究[J].计算机辅助设计与图形学学报,1999,11(5):420-425. 被引量：11
4刘策伦,张陌,张刚.验证云存储原理的实验装置[J].实验技术与管理,2016,33(8):87-90. 被引量：2
5曾瑞,王英彦.多模态图像检索技术的研究[J].无线互联科技,2015,12(6):91-92. 被引量：1
6党杰,林秋华,殷福亮.基于盲源分离的多幅顺序图像加密方法[J].电子与信息学报,2007,29(6):1471-1475. 被引量：3
7唐少辉,Ma Y S,Chen G.与历史无关的模型重建中基于规则的特征排序[J].机械设计,2008,25(7):60-62.
8邓戈.藏语单纯词构形法研究[J].西藏大学学报（藏文版）,2014(4):60-73.
9魏英姿,谭龙田,欧阳海飞,赵祉淇.玉米籽粒完整性识别的深度学习方法[J].沈阳理工大学学报,2016,35(4):1-6. 被引量：5
10陈道礼,程慧.基于对象的CAPP工艺决策方法[J].机械与电子,2006,24(9):68-71.

清华大学学报（自然科学版）

2006年第4期

浏览历史

内容加载中请稍等...

基于特征有序对量化表示的文本分类方法被引量：4

参考文献6

同被引文献22

引证文献4

二级引证文献23

相关作者

相关机构

相关主题

浏览历史

基于特征有序对量化表示的文本分类方法 被引量：4

参考文献6

同被引文献22

引证文献4

二级引证文献23

相关作者

相关机构

相关主题

浏览历史

基于特征有序对量化表示的文本分类方法被引量：4