摘要
标题分类是对一个标题性语句进行分类,通常这个标题是不超过20个字的短文本,内容精炼且概括性强。针对标题文本的特征稀疏性和含义不确定性,提出了一种融合随机森林与贝叶斯多项式的标题分类算法。该算法将贝叶斯多项式模型引入到随机森林底层分类器构建过程中,同时利用随机森林附带的OOB数据提出了一种基于二维权重分布的投票机制。最后在图书馆真实书目数据上进行实验,针对分类性能与当前基于LDA主题扩展的SVM算法进行对比,实验结果表明在一定条件下,该方法性能稳定、表现较佳。
The headlines classification is to classify for a headline statement which is not more than 20 words but is concise and summary. This paper proposed a headlines classification method based on improved random forest, which introduced Bayes polynomial model into the process of building underlying classifier, to solve the poor classification performance causing by feature fewer and uncertainty of headlines text. Meanwhile, it proposed a two-dimensional weighted voting mechanism using the out-of-bag data of random forest. Last, it conducted the experiment with the real data of library and compared with the SVM algorithm which was based on LDA theme extensions. The experimental results show that this approach has a stable performance and presents a better result under a certain conditions.
出处
《计算机应用研究》
CSCD
北大核心
2017年第4期1004-1007,共4页
Application Research of Computers
基金
国家社会科学基金资助项目(15FTQ002)
关键词
自然语言处理
标题分类
集成学习
改进随机森林
OOB二维权重分布
natural language processing
headlines classification
ensemble learning
improved random forest
OOB two-dimensional weight distribution