基于特征词权重的文本分类被引量：1

Text Classification Based on Weight of Feature Words

下载PDF

导出

摘要在文本分类时,只有少数学者利用特征词权重对文本进行向量表示,但是所使用的特征选择算法没有考虑特征词权重的正负及其范围等。因此,本文在CHI统计基础上提出一种计算特征词类相关性的新方法,并根据各类特征集中包含的特征词的数量,选用不同的文本类相关性计算方法;在判定文本类别过程中,只使用文本包含的特征词的个数及其类相关性,对含特征词少的文本也能很好判别。实验表明,该方法有效可行。 In text classification,only a few scholars used the weight of feature words to express text,but the method of feature selection they used didn＇t consider the symbol and boundary of the weight of feature words.So,on the basis of CHI statistics,this paper proposes a new way to calculate correlation-score between feature words and classification;and selects different means to get the relevance between text and classification,according to the count of feature words in each feature set.At last,in order to determine the text category,this paper just applies the number of feature words and their relevance to category,and can well judge the text contained few feature words.Experiment shows that it is an effective and feasible method to classify text.

作者杨莉万常选雷刚俞涛孔保新

机构地区江西财经大学信息管理学院江西财经大学数据与知识工程江西省高校重点实验室

出处《计算机与现代化》 2012年第10期8-13,共6页 Computer and Modernization

基金国家自然科学基金资助项目(61173146) 国家社会科学基金资助项目(12CTQ042) 江西省自然科学基金资助项目(2010GZS0067) 江西省教育厅科技重点项目(GJJ09650)

关键词文本分类特征选择特征词类相关性文本类相关性 text classification feature selection correlation-score between feature words and classification correlation-score between text and classification

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1冯书晓,徐新,杨春梅.国内中文分词技术研究新进展[J].情报杂志,2002,21(11):29-30. 被引量：25
2Luhn H P. Auto-encoding of Documents for InformationRetrieval Systems[ M]//Modem Trends in Documentation. London: Pergamon Press, 1959:45-58.
3Peng F, Schuurmans D. Combinning nai've bayes and n- Gram language models for text classification [ C ]//Lecture Notes in Computer Science, 2003,2633:335-350.
4Wei Zhihua, Miao Duoqian, Chauchat J H, et al. Feature selection on Chinese text classification using character N- Grams[C ]//Proe. of the 3rd International Conference on Rough Sets and Knowledge Technology. Chengdu, China, 2008:500-507.
5Liu Rui, Jiang Minghu. Chinese text classification based on the BVB model [ C ]//Proc. of the 4th International Conference on Semantics, Knowledge and Grid. Washing- ton DC, USA, 2008:376-379.
6Ikonomakis M, Kotsiantis S, Tampakas V. Text classifica- tion: A recent overview[ C]//Proc. of the 9th World Sci- entific and Engineering Academy and Society International Conference on Computers. Athens, Greece, 2005 : 1-6.
7Zhang Wen, Yoshida T, Tang Xijin. Text classification based on multi-word with support vector machine [ J ]. Knowledge-Based Systems, 2008,21 ( 8 ) : 879-886.
8Baker L D, MeCallum A K. Distributional clustering of words for text classification[ C]//Proc. of the 21st Annual International ACM SIGIR Conference on Research and De- velopment in Information Retrieval. Melbourne, Australia, 1998:96-103.
9Batal I, Hauskrecht M. Boosting KNN text classification accuracy by using supervised term weighting schemes [C]//Proc. of the 18th ACM Conference on Information and Knowledge Management. Hong Kong, China, 2009: 2041-2044.
10Ko Y, Seo J. Text classification from unlabeled documents with bootstrapping and feature projection techniques [ J ]. Information Processing and Management, 2009, 45 ( 1 ) : 70-83.

二级参考文献18

1孙建军,陈肇雄,薛选民,郭喜林,侯敏.基于多功能逻辑运算分析技术的汉语分词[J].计算机研究与发展,1998,35(5):385-388. 被引量：2
2张翠英,亢临生.三字歧义链自动分词方法[J].情报学报,1998,17(3):203-207. 被引量：4
3郭祥昊,钟义信,杨丽.基于两字词簇的汉语快速自动分词算法[J].情报学报,1998,17(5):352-357. 被引量：18
4严威,赵政.开发中文搜索引擎汉语处理的关键技术[J].计算机工程,1999,25(6):5-6. 被引量：24
5孙茂松,左正平,黄昌宁.消解中文三字长交集型分词歧义的算法[J].清华大学学报（自然科学版）,1999,39(5):101-103. 被引量：22
6韩客松,王永成,陈桂林.汉语语言的无词典分词模型系统[J].计算机应用研究,1999,16(10):8-9. 被引量：22
7应志伟,柴佩琪,陈其晖.文语转换系统中基于语料的汉语自动分词研究[J].计算机应用,2000,20(2):8-11. 被引量：8
8陈桂林,王永成,韩客松,王刚.一种改进的快速分词算法[J].计算机研究与发展,2000,37(4):418-424. 被引量：56
9周涛.中文搜索引擎[J].图书馆理论与实践,2000(3):52-53. 被引量：10
10李建华,王晓龙.中文人名自动识别的一种有效方法[J].高技术通讯,2000,10(2):46-49. 被引量：10

共引文献24

1任成义.基于网页的知识元挖掘[J].图书情报工作,2010,54(S1):278-281.
2陈淑珍.Web文本挖掘中的特征表示与特征提取技术[J].三明高等专科学校学报,2004,21(2):53-57. 被引量：2
3邵晓良,刘红.Web信息采集中军事主题信息的识别[J].情报杂志,2004,23(7):14-16. 被引量：2
4邵晓良,刘红.Web主题信息采集中信息主题的识别[J].现代图书情报技术,2004(10):51-54. 被引量：4
5赵艳红,费洪晓.一个基于改进的反序分词词典的中文分词算法[J].深圳职业技术学院学报,2004,3(4):28-31. 被引量：2
6王坚,赵恒永.专业搜索引擎中文分词算法的实现与研究[J].福建电脑,2005,21(7):55-55. 被引量：3
7王坚,赵恒永.专业搜索引擎的实现与研究——中文分词算法[J].电子科学技术评论,2005(3):77-79. 被引量：4
8申庆永,张建忠,何云,杨洁.中文垃圾邮件过滤系统中的实时分词算法设计[J].计算机工程与应用,2007,43(3):179-181. 被引量：1
9吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型[J].计算机应用,2007,27(12):2902-2905. 被引量：12
10罗桂琼,费洪晓,戴弋.基于反序词典的中文分词技术研究[J].计算机技术与发展,2008,18(1):80-83. 被引量：18

同被引文献14

1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：381
2LU Shinghua, CHIANG Ding' an, KEH Huanchao, et al. Chi- nese text classification by the Nave Bayes classifier and theasso- ciative classifier with multiple confidence threshold values [ J ]. Knowledge-Based Systems, 2010,23 ( 6 ) : 598 -604.
3XU Qinan, LIU Zhijing. Automatic Chinese text classification based on NSVMDT-KNN [ C ] //Prec. of the 5th International Conference on Fuzzy Systems and Knowledge Discovery, Shan- dong, China, 2008: 410-414.
4LIU Reylong. Dynamic category profiling for text filtering and classification [ J ]. Information Processing & Management, 2007, 43 (1) : 154-168.
5WANG J H, XU Y, YOU J. Sparse residue for occluded faceimage reconstruction and classification [ C ]. Pattern Recongni- tion (ICPR), 2012 21st International Conference, 2012, 11 : 1707-1710.
6YIN Jun, LIU Zhonghua, et al. Kernel spare representation based classification [J]. Neuro computing, 2012, 77 (22) : 120-128.
7HUANG J S, ZHENG C H. Independent component analysis- based penalized discriminant method for tumor classification u- sing gene expression data [J]. Bioinformatics, 2006, 22 (15) : 1855-1862.
8WRIGHT J, et al. Robust face recognition via sparse represen- tation [ J ]. IEEE Transations on Pattern Analysis and Machine Intelligence, 2009, 31 (2): 210-227.
9YANG Meng, ZHANG Lei, YANG Jian, ZHANG David. Ro- bust sparse coding for face recognition [ C ]. IEEE Computer Society Conference on Computer Vision and Pattern Recogni- tion, Colorado Springs, 2011: 625-632.
10杨林波,王士同.基于边界可信度相似的快速文本分类方法[J].计算机工程与应用,2009,45(4):156-158. 被引量：3

引证文献1

1李秀霞,邵作运,郑春厚.一种迭代加权的元样本稀疏表示中文文本分类算法[J].情报理论与实践,2014,37(6):128-132.

1林伟,柳荣其,徐熙.邮件过滤中一种改进的特征选择方法研究[J].计算机技术与发展,2009,19(1):84-87. 被引量：1
2谢娜娜,房斌,吴磊.不均衡数据集上文本分类方法研究[J].计算机工程与应用,2013,49(20):118-121. 被引量：11
3裴英博,刘晓霞.文本分类中改进型CHI特征选择方法的研究[J].计算机工程与应用,2011,47(4):128-130. 被引量：39
4罗海飞,吴刚,杨金生.基于贝叶斯的文本分类方法[J].计算机工程与设计,2006,27(24):4746-4748. 被引量：13
5樊存佳,汪友生,王雨婷.一种改进的CHI文本特征选择方法[J].计算机与现代化,2016(11):7-11. 被引量：5
6李玉鑑,周兰珍,操卫平.基于DF和CHI的联合特征提取方法及其应用[J].北京工业大学学报,2008,34(9):995-1000. 被引量：9
7伍建军,康耀红.文本分类中特征选择方法的比较和改进[J].郑州大学学报（理学版）,2007,39(2):110-113. 被引量：16
8陈炯,张永奎.一种基于词聚类的中文文本主题抽取方法[J].计算机应用,2005,25(4):754-756. 被引量：17
9李莹,张晓辉,王华勇,常桂然.一种应用向量聚合技术的KNN中文文本分类方法[J].小型微型计算机系统,2004,25(6):993-996. 被引量：13
10彭敏,张凯,朱佳晖.不均衡数据在股票研报分类中的应用[J].计算机应用研究,2017,34(3):769-772. 被引量：2

计算机与现代化

2012年第10期

浏览历史

内容加载中请稍等...

基于特征词权重的文本分类被引量：1

参考文献14

二级参考文献18

共引文献24

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于特征词权重的文本分类 被引量：1

参考文献14

二级参考文献18

共引文献24

同被引文献14

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于特征词权重的文本分类被引量：1