摘要
基于类别信息的特征权重计算方法对特征与类别的关系表达不够准确,即对于类别频率相同的特征无法比较其对类别的区分能力,因此要考虑特征在类内的分布情况。将特征的反类别频率(inverse category frequency,ICF)和类内熵(entropy)相结合引入到特征权重计算方案中,构造了两种有监督特征权重计算方案。在维吾尔文文本分类语料上进行的实验结果表明,该方法能够明显改善样本的空间分布状态并提高维吾尔文文本分类的微平均F 1值。
Feature weighting schemes based on category information is not accurate enough to express the relationship between features and categories.That is the classification ability of the features with the same category frequency can’t be compared,so the distribution of the features in the category should be considered.This paper combined the inverse category frequency(ICF)and inner category entropy of the features into the term weight calculation,and constructed two supervised feature weighting schemes.The experimental results on the Uygur text categorization dataset show that this method can obviously improve the spatial distribution of the samples and improve the micro average F 1 value of the Uygur text classification.
作者
阿力木江·艾沙
殷晓雨
库尔班·吾布力
李喆
Alimjan Aysa;Yin Xiaoyu;Kurban Ubul;Li Zhe(Network&Information Technology Center,Xinjiang University,Urumqi 830046,China;School of Information Science&Engineering,Xinjiang University,Urumqi 830046,China)
出处
《计算机应用研究》
CSCD
北大核心
2019年第11期3237-3239,3285,共4页
Application Research of Computers
基金
新疆维吾尔自治区自然科学基金资助项目(2016D01C068)
关键词
文本分类
文本特征
权重计算
类别频率
text classification
text feature
term weighting
category frequency