摘要
随着信息技术的飞速发展以及网民规模的扩大,互联网数据量与日俱增,其中含有大量非结构化文本数据,因此,文中分类已成为当前的研究热点。特征选择的好坏直接影响文本分类的精度。传统单一的特征选择方法侧重点不同,使用不同的特征选择方法选择后的特征子集可能差别较大,进而导致不稳定的分类结果。文中提出了一种混合CHI与IG的特征选择方法,引入了融合特征的指标SOM(Score of Mixed),将特征根据SOM值排序,通过预定的阈值进行特征筛选,得出相对稳定且具代表性的特征子集。实验结果表明,使用该方法进行特征选择,文本分类的效果相比使用其他特征选择方法有一定的提升。
With the rapid development of information technology and the expansion of Internet users,the amount of Internet data is increasing day by day,which contains a large amount of unstructured text data.Therefore,text categorization has become a hot research topic.The quality of feature selection directly affects the accuracy of text classification.The traditional single feature selection method has different emphasis.Feature subsets selected by using different feature selection methods may differ greatly,which leads to unstable classification results.In this paper,a feature selection method combined CHI and IG is proposed.The SOM(Score of Mixed)is introduced.The features are sorted according to the SOM value.The feature is screened by a predetermined threshold to obtain a relatively stable and representative subset of features.The experimental results show that using this method for feature selection,the effect of text classification has a certain improvement compared with other feature selection methods.
作者
唐康
汪海涛
姜瑛
陈星
TANG Kang;WANG Hai-tao;JIANG Ying;CHEN Xing(Yunnan Key Laboratory of Computer Technology Applications,Kunming University of Science and Technology,Kunming 650500,China)
出处
《信息技术》
2019年第2期53-57,共5页
Information Technology
基金
国家自然科学基金资助项目(61462049)
关键词
特征选择
卡方统计
信息增益
混合方法
feature selection
Chi-square statistics
Information gain
Hybrid method