摘要
特征权重计算是文本表示的关键,权重计算方法的优劣直接影响文本分类和聚类的准确度。基于词形和词频统计的特征加权方法过于近似和粗糙,不能有效突出具有较强类别区分度的重要特征,难以有效区分两类特征,造成了高维稀疏问题,使文本分类性能不够理想,这是特征权重计算的主要障碍。提出一种基于概念层次的特征权重计算方法,将词空间转移为概念空间,在概念层次上引入特征支持度与类别强度两个参数对特征权重进行调整。实验表明,新的方法表现了较好的分类性能,在空间维度的压缩与计算效率上也有明显的改善。
Feature weighting computation belongs to one of key problems in text document representation. Performance of feature weighting computation directly influences precision of text classification or clustering. Morphology and term frequency statistics-based feature weighting approach may suffer from ambiguity and roughness, also be incapable of giving prominence to important features with category differentiating ability. Meanwhile, traditional approach may be faced with difficulty of distinguishing between important features and otherwise. All above issues may bring forth high dimension and sparseness, and suffer from poor performance on text classification or clustering. A new concept hierarchy-based feature weighting, which introduces feature support and categorical intensity for feature weighting adjustment, is put forward. Results from experiment indicate new method performs better than traditional one on precision, vector space dimension and computation efficiency.
出处
《安徽工业大学学报(自然科学版)》
CAS
2008年第3期329-333,共5页
Journal of Anhui University of Technology(Natural Science)
基金
安徽省教育厅自然科学基金重点资助项目(2007kJ051A)
关键词
概念空间
特征权重
概念层次
特征支持度
类别强度
concept space
feature weighting
concept hierarchy
feature support
categorical intensity