摘要
根据信息学理论和贝叶斯语义模型,提出一种有效特征词发现方法,利用该方法对训练文本的原始文本特征词集进行聚类,对测试文本进行分类计算和类别标注。闭式测试的结果表明,文本识别的正确率达到了90%以上,该算法对互联网信息处理有较好的应用价值。
According to information theory and Bayesian Semantic model, this paper puts forward an effective text feature extract method which clusters the original text features of the training documents into words clusters. With this method, the test documents are classified and labeled by the text categorization system. The close experiments show that the precision is more than 90%. It is supposed to have a good application prospect in the field of internet information processing.
出处
《系统工程》
CSCD
北大核心
2004年第9期107-110,共4页
Systems Engineering
基金
广东省科技攻关项目(A1020103)
关键词
文本分类
特征抽取
KL距离
正态分布
贝叶斯概率
Text Categorization
Feature Extraction
KL Divergence
Normal Distribution
Bayesian Probability