期刊文献+

混合CHI与IG的特征选择方法研究 被引量:1

Research on feature selection method combined CHI and IG
下载PDF
导出
摘要 随着信息技术的飞速发展以及网民规模的扩大,互联网数据量与日俱增,其中含有大量非结构化文本数据,因此,文中分类已成为当前的研究热点。特征选择的好坏直接影响文本分类的精度。传统单一的特征选择方法侧重点不同,使用不同的特征选择方法选择后的特征子集可能差别较大,进而导致不稳定的分类结果。文中提出了一种混合CHI与IG的特征选择方法,引入了融合特征的指标SOM(Score of Mixed),将特征根据SOM值排序,通过预定的阈值进行特征筛选,得出相对稳定且具代表性的特征子集。实验结果表明,使用该方法进行特征选择,文本分类的效果相比使用其他特征选择方法有一定的提升。 With the rapid development of information technology and the expansion of Internet users,the amount of Internet data is increasing day by day,which contains a large amount of unstructured text data.Therefore,text categorization has become a hot research topic.The quality of feature selection directly affects the accuracy of text classification.The traditional single feature selection method has different emphasis.Feature subsets selected by using different feature selection methods may differ greatly,which leads to unstable classification results.In this paper,a feature selection method combined CHI and IG is proposed.The SOM(Score of Mixed)is introduced.The features are sorted according to the SOM value.The feature is screened by a predetermined threshold to obtain a relatively stable and representative subset of features.The experimental results show that using this method for feature selection,the effect of text classification has a certain improvement compared with other feature selection methods.
作者 唐康 汪海涛 姜瑛 陈星 TANG Kang;WANG Hai-tao;JIANG Ying;CHEN Xing(Yunnan Key Laboratory of Computer Technology Applications,Kunming University of Science and Technology,Kunming 650500,China)
出处 《信息技术》 2019年第2期53-57,共5页 Information Technology
基金 国家自然科学基金资助项目(61462049)
关键词 特征选择 卡方统计 信息增益 混合方法 feature selection Chi-square statistics Information gain Hybrid method
  • 相关文献

参考文献3

二级参考文献72

  • 1Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990.
  • 2Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57.
  • 3Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 4Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235.
  • 5Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006.
  • 6Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004.
  • 7Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38.
  • 8Bishop C M. Pattern Recognition and Machine Learning. New York, USA: Springer, 2006.
  • 9Roweis S. EM algorithms for PCA and SPCA//Advances in Neural Information Processing Systems. Cambridge, MA, USA: The MIT Press, 1998, 10.
  • 10Hofmann T. Probabilistic latent semantic analysis//Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999:289- 296.

共引文献247

同被引文献18

引证文献1

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部