摘要
【目的】改善图书和期刊论文等的书目信息的分类性能。【应用背景】采用传统向量空间模型对图书和期刊论文等书目信息分类的效果不理想,通过LDA模型挖掘文本隐含语义信息,能有效提高分类效果。【方法】通过LDA建模,用隐含主题表示文本并通过分类效果确定最优主题数,在此基础上采用SVM算法分类。【结果】实验表明,在复旦和Sogou公开语料库中的Macro_F1分别达到95.5%和93.5%;在馆藏目录及电子期刊数据库等真实书目数据中的Macro_F1分别达到77.4%和87.6%。【结论】在真实数据上的分类性能比传统向量空间模型分别提高10%和3%,达到实用水平。
[Objective] To improve the classification effect of bibliographic information of books and journal articles etc. [Context] The classification performance under the traditional vector space model is not satisfied, and LDA model can effectively improve the classification effect by mining the implied semantic information. [Methods] Using LDA model to represent each text with implied topics, the optimal number of topics is determined on the classification result.Then the SVM classification algorithm is used. [Results] Experiments show that the Macro_F1 in Fudan and Sogou corpus reach 95.5% and 93.5% respectively; the Macro_F1 on the real data from catalogue and electronic journal database reach 77.4% and 87.6% respectively. [Conclusions] The classification performance on real data is increased by 10% and 3% respectively compared to the VSM, that reaches the practical level.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第5期18-25,共8页
New Technology of Library and Information Service
关键词
LDA模型
文本分类
向量空间模型
GIBBS抽样
SVM
Latent Dirichlet Allocation Text categorization Vector Space Model Gibbs sampling Support Vector Machine