摘要
[目的/意义]大规模文本层次分类问题是当前文本分类领域中的研究难点之一。由于数据规模和类别数量巨大,分类难以达到理想的效果。针对该问题,提出基于Stacking集成学习的大规模文本层次分类方法。[方法/过程]该方法使用自上而下方法实现分类,分别采用两类策略来训练高层和低层分类器。训练高层分类器(第一层和第二层)时采用多分类策略,根据高层分类结果设计了一种约束算法来选择合适的低层分类器。训练低层分类器时采用二分类策略,利用Stacking算法训练每个低层类别的基分类器和融合分类器,通过融合分类器预测结果排名选择得分最高的分类标签作为分类结果。[结果/结论]在中文期刊数据集上的实验结果表明,该方法能够有效提升大规模文本层次分类的效果。
[Purpose/significance]Large-scale text hierarchical classification is one of the difficult points in the current text classification research field.Due to large-scale data and categories,it is difficult to achieve desired classification effect.To solve the problem,a large-scale text hierarchical classification method based on Stacking ensemble learning was proposed.[Method/process]The method used a top-down approach to classify and used two types of strategies to train high-level and low-level classifiers.The high-level(first and second)classifiers were trained to adopt the multi-classification strategy,according to the high-level classification results of the document,a constraint algorithm was designed to select the appropriate low-level classifiers.The low-level classifiers were trained to adopt the binary classification strategy,and the Stacking algorithm was used to train the base classifier and fusion classifier of each lower-level class,and the class label with the highest score was returned according to the prediction results of the fusion classifier as the classification result.[Result/conclusion]The results of the experiment on the Chinese journal literature dataset show that the proposed method can effectively improve the accuracy of large-scale text hierarchical classification.
出处
《情报理论与实践》
CSSCI
北大核心
2020年第10期171-176,182,共7页
Information Studies:Theory & Application
基金
中国工程科技知识中心建设项目“知识组织体系建设”(项目编号:CKCEST-2020-1-19)
中国科学技术信息研究所重点工作项目“多模态知识图谱构建关键技术研究”(项目编号:ZD2020-09)的成果之一。
关键词
Stacking算法
文本分类
层次分类
深度学习
集成学习
stacking algorithm
text classification
hierarchical classification
deep learning
ensemble learning