摘要
随着中文电子刊物和Web文档数量的飞速增加,中文文本自动分类工作变得日益重要.将文档视为事务,将关键词视为项,文本预处理时提出特征权重阈值,用构造的分类器对未知文档分类时,采用了CDD(Class Differen-tiate Degree)改进算法,对基于关联规则挖掘的中文文本自动分类方法进行了改进.实验结果表明,该算法能较快地获得可理解的规则并且具有较好的宏平均和微平均值.
With the rapid expansion of Chinese electronic publication and web documents, the work of automatic Chinese text categorization is important increasingly. A new method called improved automatic Chinese text categorization based on associate ruels mining is proposed in the algorithm. Each documnet and keyword is represented as transaction and item. Character threshold is introduced in the text being preprocessed. CDD(Class Differentiate Degree) improved algorithm is used when using the classifier to classify the unknown documents. Experiments confirm that this algorithm gets the understandable rules of classifer faster and better in terms of the average promising recall and precision rate.
出处
《郑州大学学报(理学版)》
CAS
2007年第2期114-117,共4页
Journal of Zhengzhou University:Natural Science Edition
基金
重庆市科委自然科学基金资助项目
编号CSTC2006BB2021
关键词
关联规则挖掘
中文文本
文本自动分类算法
associate rules mining
Chinese documents
text automatic classified algorithm