期刊文献+

一种面向非平衡数据的邻居词特征选择方法 被引量:2

Neighbor Words Selection Algorithm on Imbalanced Data
下载PDF
导出
摘要 在非平衡数据情况下,由于传统特征选择方法,如信息增益(Information Gain,IG)和相关系数(Correlation Coefficient,CC),或者不考虑负特征对分类的作用,或者不能显式地均衡正负特征比例,导致特征选择的结果下降.本文提出一种新的特征选择方法(Positive-Negative feature selection,PN),用于邻居词的选择,实现了文本中术语的自动抽取.本文提出的PN特征选择方法和CC特征选择方法相比,考虑了负特征;和IG特征选择方法相比,从特征t出现在正(负)训练文本的文本数占所有出现特征t的训练文本数比例的角度,分别显式地均衡了正特征和负特征的比例.通过计算特征t后面所跟的不同(非)领域概念个数占总(非)领域概念个数比值分别考察正、负特征t的重要性,解决了IG特征选择方法正特征偏置问题.实验结果表明,本文提出的PN特征选择方法优越于IG特征选择方法和CC特征选择. The performance of traditional feature selection algorithms, e.g. IG and CC, will be decreased because of either without considering the negative features, or without combining the positive features and negative features explicitly on imbalanced data. In this paper,a novel feature selection algorithm,named PN (Positive-Negative feature selection) ,is proposed for term extraction. Comparing with CC,PN considers the negative features,which are quite valuable in imbalanced data. Comparing with IG, PN considers the positive feature and negative feature independently and explicitly, and the values of positive and negative features are adjusted by compute the proportion of number of terms followed by feature t,which solved the problem of the much larger values of positive features. The experimental results show that the performance of PN algorithm outperforms those of CC and IG.
作者 孙霞 郑庆华
出处 《小型微型计算机系统》 CSCD 北大核心 2008年第12期2334-2338,共5页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60473136)资助 博士点基金项目(20040698028)资助
关键词 特征选择 非平衡数据 术语抽取 邻居词 feature selection imbalanced data term extraction neighbor word
  • 相关文献

参考文献10

  • 1Forman G. An extensive empirical study of feature selection metrics for text classification[J].Journal of Machine Learning Research, Special Issue on Variable and Feature Selection. 2003,(3):1289-1305.
  • 2Raman B,Ioerger TR. Instance based filter for feature selection [J].Journal of Machine Learning Research, 2002 (1)
  • 3Fabrizio Sebastiani. Machine learning in automated text categorization[J]. ACM Computing Surveys. 2002,:34 (1) : 1-47.
  • 4Liu Tao, Liu Sheng-ping,Chen Zheng, et al. An evaluation on feature selection for text clustering[C]. Proceedings of the 20th International Conference on Machine Learning (ICML-03). 2003,488-495.
  • 5Tom M Mitchell. Machine Learning[M].McGraw Hill, New York, 1997.
  • 6Frantzi K T, Ananiadou S. The C-value/nc-value domain independent method for multi-word term extraction[J].Journal of Natural Language. Processing, 1999,6(3) : 145-179.
  • 7孙霞,郑庆华,王朝静,张素娟.一种基于生语料的领域词典生成方法[J].小型微型计算机系统,2005,26(6):1088-1092. 被引量:11
  • 8Zheng Zhao-hui,Wu Xiao-yun,Rohini Srihari. Feature selection for text categorization on imbalanced data[J].ACM Sigkdd Explorations, 2004,6 (1) : 80-89.
  • 9http://www. csie. ntu. edu. tw/cjlin/libsvm,2006.
  • 10Yang Yi-ming,Pedersen J O. A comparative study on feature selection in text categorization[C]. Proceedings of the 14th international Conference on Machine Learning (ICML-97),1997.

二级参考文献11

  • 1黄萱菁,吴立德,王文欣,叶丹瑾.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能,1996,9(4):297-303. 被引量:24
  • 2傅兴岭.现代汉语通用字典[M].汉语教学与研究出版社,1987..
  • 3Ge Xian-ping, Wanda Pratt, Padhraic Smyth. Discovering Chinese words from unsegmented text[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999:271-272.
  • 4Chien Lee-feng. PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval [A]. Information Processing and Magagement (IPM) [M]. Elsevier Press, 1999,35(4):501-521.
  • 5Christopher S G. Khoo Yubin Dai. Using statistical and contextual information to identify two-and three-character words in Chinese text [J]. Journal of the American Society for Information Science and Technology. 2002,53(5) :365-377.
  • 6Honglan Jin, Kam-Fai Wong. A Chinese dictionary construction algorithm for information retrieval [ EB/OL ]. 2002http://www. se. cuhk. edu. hk/dn/TALIP-02-a35. doc.
  • 7Tang Hai-jiang,Pascale Fung. A multi-path syllable to word decoder with language model optimization and automatic lexicon augmentation[J]. 2000 International Symposium on Chinese Spoken Language Processing,Beijing,China,Oct 2000.
  • 8刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1998,12(1):17-25. 被引量:65
  • 9韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J].中文信息学报,2001,15(2):23-30. 被引量:36
  • 10金翔宇,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J].中文信息学报,2001,15(6):33-39. 被引量:28

共引文献10

同被引文献7

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部