期刊文献+

文本分类中粗分类数据噪声修正的网络算法 被引量:1

Network-based Noise Revision Algorithm in Text Categorization
下载PDF
导出
摘要 在文本分类的实际应用中经常使用粗略分类的数据来训练分类器,但是这种数据中经常会包含类别标记有误的数据,这些数据对文本分类结果的精度会造成不良影响。本文针对这个问题提出了一种噪声修正算法,首先建立文档关联网络,把文档上标记的类别作为在网络上划分的集团结构,并用模块度衡量集团结构的质量,通过优化模块度指标把噪声数据调整到合适的类别中,从而提高数据质量。实验结果表明,本文所提算法能够有效修正粗分类数据中的噪声,且有较高的有效性和鲁棒性。该算法可以用于文本分类训练数据的预处理,或作为辅助技术用于文献库建设等工作。 Training data is necessary to train the classifiers in Text Categorization. In fact, there are always some documents distributed to a wrong category in training text corpus, which are named noise texts. If we use noise texts in text mining applications directly, the efficiency of the text mining will be influenced. This paper proposes a revision algorithm for noise texts based on network. Firstly, document-similarity network (DSN) is constructed. The categories constitute the corresponding community structure in the network, and modalarity is used to evaluate the quality of the categories. The noise texts can be revised through modularity optimization. The experimental results indicate the efficiency and robustness of the algorithm. This algorithm can be used in the preprocessing of text mining or taxonomy building.
出处 《情报学报》 CSSCI 北大核心 2008年第5期670-676,共7页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金重点项目(70431001),国家自然科学基金重大国际合作项目(70620140115),国家自然科学基金资助项目(70271046,70301009)
关键词 噪声数据修正 模块度优化 文本分类 集团结构 noise texts revision, modularity optimization, text categorization, community structure
  • 相关文献

参考文献12

  • 1陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量:79
  • 2Qiong Chen. Feature Selection for the Topic-Based Mixture Model in Factored Classification [ M ]. International Conference on Computation Intelligence and Security, Oct. 2006 : 39-44.
  • 3Li Ronglu, Hu Yunfa. Noise Reduction to Text Categorization Based on Density for KNN[M]. International Conference on Machine Learning and Cybernetics, Noy. 2003: 3119-3124.
  • 4Zhou Shuigeng, Tok Wang Ling, Jihong Guan, et al. Fast Text Classification: A Training-Corpus Pruning Based Approach[ M]. Proceedings Eighth International Conference on Database Systems for Advanced Applications, Mar. 2003: 127-136.
  • 5David Bell, Guan Jiwen, Bi Yaxin. On Combining Classifier Mass Functions for Text Categorization[ M]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17: 1307-1319.
  • 6杨建良,王永成.基于KNN与自动检索的迭代近邻法在自动分类中的应用[J].情报学报,2004,23(2):137-141. 被引量:18
  • 7李荣陆.中文文本分类语料[OL].[2007-02-10].http://www, nip. org. cn/docs/download, php? doc_ id = 281.
  • 8R. Baeza-Yates and B. Ribeiro-Neto. Modem Information Retrieval[M]. Addison Wesley, 1999.
  • 9程泽凯 ,林士敏 .文本分类器准确性评估方法[J].情报学报,2004,23(5):631-636. 被引量:13
  • 10MicheUe Girvan and Mark Newman. Community structure in social and biological networks[M]. Proc, Natl. Acad. Sci. USA, 2002, 99(12): 7821-7826.

二级参考文献40

  • 1Y. Yang , J. P. Pedersen. A comparative study on feature selection in text categorization. In: Proc. of the 14th ICML' 971997,412 ~ 420
  • 2Fabrizio Sebastian. Machine learning in automated text categorization. Journal of the ACM(JACM) ,2002,34(1) :1 ~ 47
  • 3边肇祺,张学工.模式识别(第二版).北京:清华大学出版社,2000
  • 4.[EB/OL].TREC官方网站http://trec.nist.gov,.
  • 5.北大中文网页自动分类竞赛规则.[S].李小明,2003..
  • 6Han Jiawei, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000
  • 7C. Blake, E. Keogh, and C. Merz. UCI repository of machine learning databases, 1998http://www. ics. uci. edu/~ mlearn/MLRepository. html
  • 8Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
  • 9Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
  • 10Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.

共引文献105

同被引文献19

  • 1吴慧.海南省降水量的正态分布特征及正态化变换[J].广东气象,2005,27(2):12-13. 被引量:17
  • 2宋玲,马军,连莉,张志军.文档相似度综合计算研究[J].计算机工程与应用,2006,42(30):160-163. 被引量:40
  • 3程传鹏.中文网页分类的研究与实现[J].中原工学院学报,2007,18(1):61-64. 被引量:13
  • 4王强,关毅,王晓龙.基于特征类别属性分析的文本分类器分类噪声裁剪方法[J].自动化学报,2007,33(8):809-816. 被引量:2
  • 5Vinciarelli A. Noisy Text Categorization [J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2005,27(12): 1882-1895.
  • 6Yan J, Liu N, Zhang B, et al. OCFS: Optimal OrthogonalCentroid Feature Selection for Text Categorization [C]. In:Proceedings of the 28th Annual International ACM SIGIRConference on Research and Development in InformationRetrieval (SIGIR,05). 2005: 122-129.
  • 7Li R L, Hu Y F. Nosice Reduction to Text CategorizationBased on Density for KNN [C]. In: Proceedings of the 2ndInternational Conference on Machine Learning andCybernetics, Xi’an, China. IEEE, 2003: 3119-3124.
  • 8Xu J, Chen C, Xu G, et al. Improving Quality of TrainingData for Learning to Rank Using Click-through Data [C]. In:Proceedings of the 3rd ACM International Conference onWeb Search and Data Mining. New York; ACM, 2010:171-180.
  • 9Carvalho V R, Elsas J L, Cohen W W, et al. SuppressingOutliers in Pairwise Preference Rankings[C]. In: Proceedingsof the 17th ACM Conference on Information and KnowledgeManagement (CIKM,08). New York: ACM, 2008:1487-1488.
  • 10Nettleton D F, Orriols-Puig A, Foraells A. A Study of theEffect of Different Types of Noise on the Precision ofSupervised Learning Techniques [J]. Artificial IntelligenceReview, 2010, 33(4): 275-306.

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部