摘要
针对传统随机森林算法对文本特征提取质量不高导致分类效果差的问题,提出一种对图书等大数据量文本信息文本的改进的随机森林算法。又由于传统随机森林决策树质量难以保证,提出一种加权投票提高决策树质量的机制。算法主要由两方面组成,一方面是基于文本主题特征提取的Tr-K方法,目的是提高文本主题特征的质量与代表性;另一方面是基于bootstrap抽样时遗留的1/3袋外数据提出的验证机制。文中采用的是20 Newsgroups数据集和来自于搜狗实验室提供的中文分类语料库,中英文两种数据集充分考虑了该模型的泛化性,并在实验中验证了不同数据集下较传统随机森林算法拥有更优秀的分类能力。Python环境下的实验数据表明,该方法在文本分类中相对于C4.5、KNN、SVM、原始随机森林算法可以取得更好的结果。
In view of the problem of poor classification effect caused by low quality of extracting text features for the traditional random forest algorithm,an improved random forest algorithm for the text of big data like books is proposed.Since the quality of traditional random forest decision tree is difficult to guarantee,a weighted voting mechanism to improve the quality of decision-making tree is presented.The algorithm is mainly composed of two aspects.One is the Tr-K method based on text theme feature extraction,which aims to improve the quality and representation of text features.The other is the verification mechanism of 1/3 of the extra-bags of data left over from the bootstrap sampling.We use the 20 Newsgroups dataset and the Chinese corpus from the Sogou Lab.For the Chinese and English datasets,we take full consideration of the generalization of the model and verify that it has better classification ability compared with the traditional random forests under different datasets.The experimental data in Python environment show that the proposed method can achieve better results in text classification relative to C4.5,KNN,SVM and original random forest algorithm.
作者
孙彦雄
李业丽
边玉宁
SUN Yan-xiong;LI Ye-li;BIAN Yu-ning(Beijing Institute of Graphic Communication,Beijing 102600,China)
出处
《计算机技术与发展》
2020年第6期65-70,共6页
Computer Technology and Development
基金
北京市科技创新服务能力协同创新项目(PXM2016_014223_000025)。