期刊文献+

基于BIG-WFCHI的微博信息关键特征选择方法

Key Feature Selection Method for Weibo Information Based on BIG-WFCHI
下载PDF
导出
摘要 特征选择是用机器学习方法提高转发预测精度和效率的关键步骤,其前提是特征提取.目前,特征选择中常用的方法有信息增益(Information Gain,IG)、互信息和卡方检验(CHI-square test,CHI)等,传统特征选择方法中出现低频词引起的信息增益和卡方检验的负相关、干扰计算等问题,导致分类准确率不高.本文首先针对低频词引起的信息增益和卡方检验的负相关、干扰计算等问题进行研究,分别引入平衡因子和词频因子来提高算法的准确率;其次,根据微博信息传播的特点,结合改进的IG算法和CHI算法,提出了一种基于BIG-WFCHI(Balance Information Gain-Word Frequency CHI-square test)的特征选择方法.实验分析中,本文采用基于最大熵模型、支持向量机、朴素贝叶斯分类器、KNN和多层感知器5种分类器对两个异构数据集进行了测试.实验结果表明,本文提出的方法能有效消除无关特征和冗余特征,提高分类精度,并减少运算时间. Feature selection,whose premise is feature extraction,is a key step to improve the accuracy and efficiency in retweeting prediction through achine learning methods.Currently,the approaches commonly adopted in feature selection include Information Gain(IG),mutual information,and CHI-square test(CHI).In the traditional feature selection methods,such problems of IG and CHI as negative correlation and interference calculation elicited by low-frequency words lead to low classification accuracy.In view of these problems,we introduce a balance factor and a word frequency factor in this study to increase the algorithm accuracy.Then,according to the spread characteristics of Weibo information,combined with the improved IG and CHI algorithms,we propose the feature selection method based on Balance Information Gain-Word Frequency CHI-square test(BIG-WFCHI).Furthermore,we experimentally test the proposed method with five classifiers including maximum entropy model,support vector machine,naive Bayes classifier,K-nearest neighbor,and multi-layer perceptron on two heterogeneous data sets.The results show that our method can effectively eliminate both irrelevant and redundant features,increase the classification accuracy,and reduce the running time.
作者 殷仕刚 安洋 蔡欣华 屈小娥 YIN Shi-Gang;AN Yang;CAI Xin-Hua;QU Xiao-E(Department of Information Management,Xi’an University of Technology,Xi’an 710048,China;School of Computer Science and Engineering,Xi’an University of Technology,Xi’an 710048,China)
出处 《计算机系统应用》 2021年第2期188-193,共6页 Computer Systems & Applications
基金 国家自然科学基金(61672027)。
关键词 微博信息 特征选择 机器学习 信息增益 卡方检验 Weibo information feature selection machine learning Information Gain(IG) CHI-square test(CHI)
  • 相关文献

参考文献8

二级参考文献81

  • 1张国英,沙芸,余有明,刘玉树.基于属性相似度的云分类器[J].北京理工大学学报,2005,25(6):499-503. 被引量:11
  • 2李烨,尹汝泼,蔡云泽,许晓鸣.基于离散化的支持向量机特征选择[J].计算机工程,2006,32(11):16-17. 被引量:4
  • 3张国英,沙芸,江慧娜.基于粒子群优化的快速KNN分类算法[J].山东大学学报(理学版),2006,41(3):120-123. 被引量:8
  • 4王书诏,邱天爽.说话人识别研究综述[J].电声技术,2007,31(1):51-55. 被引量:9
  • 5HanJ KamberM.数据挖掘:概念与技术[M].北京:机械工业出版社,2001..
  • 6黄昌宁 等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
  • 7Byun H,Lee SW.Applications of support vectormachines for pattern recognition[J].LectureNotes in Computer Science,2002,2388:571-591.
  • 8Chen P H,Fan R E,Lin C J.A study on SMO-type decomposition methods for support vectormachines[J].IEEE Trans Networks,2006,17(4):893-908.
  • 9Dong J-X,Krzyzak A,Suen C Y.Fast SVMtraining algorithm with decomposition on verylarge data sets[J].IEEE Trans Pattern Analysisand Machine Intelligence,2005,27(4):603-618.
  • 10朱雪龙.应用信息论基础[M].清华大学出版社,2000..

共引文献319

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部