摘要
文档向量化的质量对于文本分类的速度和准确度有着很大的影响。对文档向量化中常用的TF-IDF公式,互信息量公式以及信息增益公式进行了分析。提出一种基于词频差异的特征选取方法和改进的TF-IDF公式,以提高特征选取质量和文本分类的速度及准确度。
The vectofization of documents affects the speed and accuracy of text categorization greatly. The most common used formulas: TF-IDF, MI, and IG were analyzed. The method of feature selection based on word frequency differentia was proposed and TF-IDF formula was modified to improve the quality of feature selection, the speed and accuracy of categorization.
出处
《计算机应用》
CSCD
北大核心
2005年第9期2031-2033,共3页
journal of Computer Applications