摘要
从信息论的角度,提出了一种新的文本分类模型。该模型以文本提供的关于类别的信息作为分类依据,从另一个角度来思考文本分类问题。从实用性的角度来看,该模型与传统的朴素贝叶斯模型和基于KL距离的中心向量法具有一定的关系,并给出了证明。根据广义信息论的基本概念,又对此模型进行推广,提出了特征权重的概念,可以通过修正特征权重来修正文本分类模型,为成功解决文本分类模型的修正问题提供了理论基础。
A new text classification model from the perspective of information theory is proposed. Considering text classification problem from another angle, this model employed the category information obtained from the text as the basis for classification. From the view of practicability, we proved it that this model has some relationships with the traditional naive Bayesian model and KL-distance based central vector method. According to the basic concept of generalized information theory, the promotion is carried on to this model and introduced the concept of feature weight, which has provided a foundational theory for solving the text classification model revision question successfully.
出处
《计算机工程与设计》
CSCD
北大核心
2008年第24期6312-6315,共4页
Computer Engineering and Design
基金
国家973重点基础研究发展计划基金项目(2004CB318109、2007CB311100)
关键词
文本分类
信息论
广义信息论
互信息
信息熵
特征权重
text classification
information theory
general information
theory mutual information
information entropy
feature weight