摘要
作为一种有效的数据挖掘方法,文本分类逐渐成为了关注热点。而文本分类过程繁杂,涉及关键技术多种多样,其中,特征选择在文本分类过程中起到了重要作用,而CHI正是一种常用的文本特征选择方法。针对该模型的不足之处,以特征项的词频因素及其正负相关的情况为依据,对CHI模型进行逐步优化,使得特征项频数和正负相关信息得到了有效利用,随后的文本分类实验证明了本文中改进的CHI文本特征选择方法的可行性。
As an effective data mining method, text categorization has become a hot spot nowadays. The text classification process is complicated, involving a variety of key technologies, of which feature selection has played an important role in the text classification process, and CHI is a commonly used method of text feature selection. In view of the deficiencies of the model, the CHI model is gradually optimized based on the word frequency factor of the feature term and its positive and negative correlations, which makes the frequency and positive and negative correlation information of the feature term effectively used. Subsequent text classification experiments proved the feasibility of the improved CHI text feature selection method in this paper.
作者
林智健
Lin Zhijian(College of Computer And Information Science, Chongqing Normal University, Chongqing 401331, China)
出处
《信息与电脑》
2018年第7期172-176,共5页
Information & Computer
关键词
文本分类
数据预处理
特征选择
text classification
data preprocessing
feature selection