摘要
利用信息增益函数对文档进行特征提取,根据特征在文档中出现的次数,将文档表示成为向量的形式。假设文档的特征之间是相互独立的,其特征和主题类别的联合概率分布为服从多项式分布。利用训练集中已标注的文档、学习特征和主题类别的联合概率分布参数,根据学习的结果,对测试集中未分类的文档进行分类。实验结果表明,分类具有较高的准确性。
By using the function of information gain, the documents attributes are obtained. Depending on the times of an attribute occurrences in the documents, the document is represented as a vector consisting of 1 and 0. Supposing that the elements of the set of attributes are mutual conditional independent, and the probability distribution of the attributes between the categories is the multinomial distribution. Using the documents in the training set, the parameter of the multinomial distribution is learned. Based on the results of learning and bay sian theory, the documents in the test set is classified.
出处
《华北电力大学学报(自然科学版)》
CAS
北大核心
2003年第6期83-85,共3页
Journal of North China Electric Power University:Natural Science Edition
基金
华北电力大学青年教师基金资助(060203)
关键词
互联网
WEB
文本分类
多项式分布模型
数据挖掘
text categorization
attributes extracting
multinomial distribution
independence assumption