摘要
提出一种基于改进TFIDF算法的海量文本分类识别方法,将特征之间的信息熵与特征内信息熵作为文本分类识别的加权因子,采用神经网络的非线性映射能力实现权值计算和TFIDF算法的模糊化,从而解决文本分类不准确和海量文本的分类问题。采用5个类别文档,每个类别5个文档,3个特征项来进行实际试验验证,结果表明,改进的TFIDF算法能够更好的实现文本识别分类,具有更小的方差特性,对随机文本分布具有更强的鲁棒性,收敛速度更快,具有很好的应用价值。
A method of mass text classification and recognition based on fuzzy TFIDF algorithm was proposed, the information entropy between the characteristics and the inner information entropy characteristics were selected to divide the text,the nonlinear mapping ability of neural network was used to calculate weight and fuzzy TFIDF algorithm. In the experiment,five kinds of text with 5 in each and 3 characteristics were taken to test ability of improved TFIDF algorithm and traditionalTFIDF algorithm, the result shows that the improved TFIDF algorithm can classify and recognize text with good ability, theimproved method shows less variance property, and stronger robustness in processing the random text data, and the convergence rate is faster, it shows good application value in practice.
出处
《科技通报》
北大核心
2014年第4期191-193,共3页
Bulletin of Science and Technology
基金
校园网搜索引擎设计开发(09190107051)
关键词
改进TFIDF
文本分类
神经网络
特征
improved TFIDF algorithm
text categorization
neural network
feature