基于基尼的模糊kNN分类器(英文)

Fuzzy kNN Text Classifier Based on Gini Index

下载PDF

导出

摘要随着网络的发展,大量的文档涌现在网上,自动文本分类成为处理海量数据的关键技术。在众多的文本分类算法中,kNN算法被证明是最好的文本分类算法之一。对于大多数文本分类来说,文本预处理是文本分类的瓶颈,文本预处理的好坏直接影响着分类的性能。在此介绍了一种新的文本预处理算法——基于基尼的文本预处理算法。同时采用模糊集理论改进kNN的决策规则。这两者的结合使得模糊kNN比传统的kNN表现出更好的分类性能。实验结果证明这种改进是有效的,可行的。 With the development of Web ,large numbers of documents are available on Internet. Automatic text categorization becomes more and more important for dealing with massive data. In numerous text categorization algorithms,kNN algorithm is proved one of the best text categorization algorithms. But for kNN classifier and other classifiers,text preprocessing before categorization is a bottleneck. The results of text preprocessing directly affect the categorization performance. This paper present a new text preprocessing algorithm text preprocessing algorithm based on Gini index. At the same time ,this paper adopt the theory of fuzzy sets to improve the decision rule of kNN algorithm. The combination of these two methods makes the fuzzy kNN classifier show better categorization performance than classical kNN algorithm. Experiment results show that our algorithm is effective and feasible.

作者尚文倩瞿有利黄厚宽朱海滨林永民董红斌

机构地区北京交通大学计算机学院尼普森大学计算科学与数学系

出处《广西师范大学学报（自然科学版）》 CAS 北大核心 2006年第4期87-90,共4页 Journal of Guangxi Normal University:Natural Science Edition

基金 National Natural Science Foundation of China (60503017) Beijing Jiaotong University Science Foun-dation (2004RC008)

关键词文本分类 KNN 模糊kNN 文本预处理 GINI INDEX text categorization kNN fuzzy kNN text preprocessing Gini index

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献2

1李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量：95
2石洪波,王志海,黄厚宽.一种基于TAN的文本分类方法[J].广西师范大学学报（自然科学版）,2003,21(1):81-85. 被引量：4

二级参考文献28

1[1]Langley P,Iba W,Thompson K.An analysis of bayesian classifiers[A].Proceedings tenth national conference on artificial intelligence[C].Menlo Park,CA:AAAI Press,1992.223-228.
2[2]Friedman N,Geiger D,Goldszmidt M.Bayesian network classifiers[J].Machine Learning,1997,29:131-163.
3[3]Pearl J.Probabilistic reasoning in intelligent systems:Networks of plausible inference[M].San Francisco:Morgan Kaufman Publishers,1988.122-150.
4[4]Chickering D M.Learning bayesian networks is NP-complete[A].Horvitz Eric,Jensen Finn V.Proceedings of the 12th conference on uncertainty in artificial intelligence[C].San Francisco:Morgan Kaufmann Publishers,1996.210-216.
5[5]Dumais S,Platt J,Heckerman D,et al.Inductive learning algorithms and representations for text categorization[A].Makki K,Bouganim L.Proceedings international conference on information and knowledge management[C].New York:ACM Press,1998.148-155.
6[6]Yang Y.An evaluation of statistical approaches to text categorization[J].Journal of Information Retrieval,1999,1(1/2):67-88.
7[7]Lam W,Ho C Y.Using a generalized instance set for automatic text categorization[A].Moffat Alistair,Wilkinson Ross.Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval[C].New York:ACM Press,1998.81-89.
8[8]Han E H,Karypis G,Kumar V.Text categorization using weight adjusted k-nearest neighbor classification[A].Cheung D,Williams G J,Li Q.Proceedings of the 5th Pacific Area conference on knowledge discovery and data mining (PAKDD 2001).Lecture notes in artificial intelligence (LNAI)[C].Berlin:Springer,2001.53-65.
9[9]Yang Y,Chute C G.An application of least squares fit mapping to text information retrieval[A].Korfhage Robert,Rasmussen Edie,Willett Peter.Proceedings of 16th annual international ACM SIGIR conference on research and development in information retrieval[C].New York:ACM Press,1993.281-290.
10[10]Mccallum A,Nigam K.A comparison of event models for naive bayes text classification[DB/OL].http://citeseer.nj.nec.com/mccallum98comparison.html.1999.

共引文献97

1陈文庆,李勤,姚伽华.基于最大熵模型的垃圾邮件过滤方法[J].网络安全技术与应用,2005(1):16-18. 被引量：1
2修宇,王士同,朱林,宗成庆.极大熵球面K均值文本聚类分析[J].计算机科学与探索,2007,1(3):331-339. 被引量：1
3钱晶,张杰,张涛.基于最大熵的汉语人名地名识别方法研究[J].小型微型计算机系统,2006,27(9):1761-1765. 被引量：26
4苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量：378
5鲁明羽.Bayes文本分类器的改进方法研究[J].计算机工程,2006,32(17):63-65. 被引量：11
6周琳.摄影,靠的就是眼力[J].军事记者,2006(10):52-52.
7尚文倩,黄厚宽,刘玉玲,林永民,瞿有利,董红斌.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10):1688-1694. 被引量：38
8崔彩霞,王素格.基于粗集的支持向量机文本分类方法研究[J].科技广场,2006(8):4-6. 被引量：1
9司广涛,李培峰,朱巧明,李军辉.基于最大熵模型的邮件过滤系统研究[J].计算机工程与应用,2006,42(32):119-121.
10贾宁.使用概念基元特征进行自动文本分类[J].计算机工程与应用,2007,43(1):24-26. 被引量：6

1林永民,朱卫东.模糊kNN在文本分类中的应用研究[J].计算机应用与软件,2008,25(9):185-187. 被引量：4
2陈铭,吉根林.一种基于相似维的高维子空间聚类算法[J].南京师大学报（自然科学版）,2010,33(4):119-122. 被引量：3
3赵玉丹,王倩,范九伦,刘颖,高梓铭.基于模糊KNN的刑侦图像场景分类[J].计算机应用研究,2014,31(10):3158-3160. 被引量：9
4林永民,吕震宇,赵爽,朱卫东.向量空间模型中特征加权的研究[J].情报杂志,2008,27(3):5-7. 被引量：6
5邱宁佳,郭畅,杨华民,王鹏,温暖.基于MapReduce编程模型的改进KNN分类算法研究[J].长春理工大学学报（自然科学版）,2017,40(1):110-114. 被引量：3
6杜琳娜,闫光辉,杨霞霞,刘利松.一种改进的KNN中文文本分类算法[J].软件导刊,2010,9(2):51-53. 被引量：2
7黄力,张增芳,朱亚超.基于二叉决策的机器人路径规划研究[J].机械设计与制造,2008(3):156-158. 被引量：2
8吕锋,杜妮,文成林.一种模糊-证据kNN分类方法[J].电子学报,2012,40(12):2390-2395. 被引量：12
9刘海峰,姚泽清,刘守生,苏展.基于聚类降维的改进KNN文本分类[J].微计算机信息,2010,26(3):18-20. 被引量：2
10路永和,何新宇.基于维度索引表的改进KNN分类算法[J].情报理论与实践,2014,37(5):102-106. 被引量：3

广西师范大学学报（自然科学版）

2006年第4期

浏览历史

内容加载中请稍等...

基于基尼的模糊kNN分类器(英文)

参考文献2

二级参考文献28

共引文献97

相关作者

相关机构

相关主题

浏览历史