期刊文献+

LDA模型下书目信息分类系统的研究与实现 被引量:12

Research and Implementation of Bibliographic Information Classification System in LDA Model
原文传递
导出
摘要 【目的】改善图书和期刊论文等的书目信息的分类性能。【应用背景】采用传统向量空间模型对图书和期刊论文等书目信息分类的效果不理想,通过LDA模型挖掘文本隐含语义信息,能有效提高分类效果。【方法】通过LDA建模,用隐含主题表示文本并通过分类效果确定最优主题数,在此基础上采用SVM算法分类。【结果】实验表明,在复旦和Sogou公开语料库中的Macro_F1分别达到95.5%和93.5%;在馆藏目录及电子期刊数据库等真实书目数据中的Macro_F1分别达到77.4%和87.6%。【结论】在真实数据上的分类性能比传统向量空间模型分别提高10%和3%,达到实用水平。 [Objective] To improve the classification effect of bibliographic information of books and journal articles etc. [Context] The classification performance under the traditional vector space model is not satisfied, and LDA model can effectively improve the classification effect by mining the implied semantic information. [Methods] Using LDA model to represent each text with implied topics, the optimal number of topics is determined on the classification result.Then the SVM classification algorithm is used. [Results] Experiments show that the Macro_F1 in Fudan and Sogou corpus reach 95.5% and 93.5% respectively; the Macro_F1 on the real data from catalogue and electronic journal database reach 77.4% and 87.6% respectively. [Conclusions] The classification performance on real data is increased by 10% and 3% respectively compared to the VSM, that reaches the practical level.
出处 《现代图书情报技术》 CSSCI 北大核心 2014年第5期18-25,共8页 New Technology of Library and Information Service
关键词 LDA模型 文本分类 向量空间模型 GIBBS抽样 SVM Latent Dirichlet Allocation Text categorization Vector Space Model Gibbs sampling Support Vector Machine
  • 相关文献

参考文献15

  • 1Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
  • 2Hofmann T. Prnbabilistie Latent Semantic Indexing [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, United States. New York: ACM, 1999: 50-57.
  • 3Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 4刁宇峰,杨亮,林鸿飞.基于LDA模型的博客垃圾评论发现[J].中文信息学报,2011,25(1):41-47. 被引量:23
  • 5黄小亮,郁抒思,关佶红.基于LDA主题模型的软件缺陷分派方法[J].计算机工程,2011,37(21):46-48. 被引量:11
  • 6廖晓锋,王永吉,范修斌,吴敬征.基于LDA主题模型的安全漏洞分类[J].清华大学学报(自然科学版),2012,52(10):1351-1355. 被引量:11
  • 7孙李斌,马贤明,赵明明.基于LDA主题模型的遥感图像表示与分类[J].科技视界,2013(7):58-58. 被引量:1
  • 8张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590. 被引量:75
  • 9Phan X, Nguyen M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections [C]. In: Proceedings of the 17th Conference on World Wide Web. New York: ACM, 2008: 91-100.
  • 10Dempster A P, Laird N M, Rubin D B. Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of the Royal Statistical Society, 1977, 39(1): 1-38.

二级参考文献88

  • 1苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:378
  • 2伍建军,康耀红.文本分类中特征降维方式的研究[J].海南大学学报(自然科学版),2007,25(1):62-66. 被引量:4
  • 3D. Blei and J. Lafferty, Correlated topic models [C]//Advances in Neural Information Processing Gystems 18, MIT Press, Cambridge, MA. 2006.
  • 4Qiaozhu Mei, Xu Ling,Matthew Wondra, Hang Su, ChengXiang Zhai, Topic Sentiment Mixture: Model ing Facets and Opinions in Web logs[C]//Proceedings of the 16th international conference on World Wide Web (WWW 2007), Banff, Alberta, Canada: 171-180.
  • 5Yue Lu, Chengxiang Zhai. Opinion Integration Through Semi-supervised Topic Modeling[C]//Proceedings of the 17th International Conference on World Wide Web (WWW 2008) ,Beijing, China: 121- 130.
  • 6Xing Wei, W. B. Croft, LDA-based Document Models for Ad hoc Retrieval[C]//Proceedings of the 29^th SIGIR Conference, Seattle, Washington, USA, 2006: 178-185.
  • 7B. Liu. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data [M]. Springer, 2007.
  • 8Vapnik V. , The Nature of Statistical Learning Theory [M]. New York: Springer,1995.
  • 9中科院分词系统:http://ictclas.org[DB/OL].
  • 10C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, S. Vigna. A Referenee Collec tion for Web Spam[C]//ACM SIGIR Forum,2006,40 (2) :11-24.

共引文献241

同被引文献116

引证文献12

二级引证文献114

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部