
基于种子词汇的话题标签抽取研究 被引量:7

Topic Label Extraction Based on Seed Words
摘要 传统话题模型用词项概率分布表示话题,在可解释性上存在很大的不足。该文在Latent Dirichlet Allocation(LDA)的结果上提出了一种基于种子词汇的话题标签抽取方法。首先根据提出的权重计算公式抽取每个话题的种子词,然后,采用bootstrapping思想,迭代产生包含种子词汇的关键短语集合,最后根据短语的完整性和泛化度选择话题标签。该文对两会报告话题和新闻事件话题进行实验,通过结果展示和人工评测,该方法抽取的话题标签能够较准确地表达话题的语义信息。 Traditional topic models use word probability distribution to represent topics.These words are difficult to be understandable and express a consistent meaning.This paper proposed a topic label extraction method based on seed words.The method first extracts topic seed words according to weight formulas,then uses bootstrapping algorithm to generate a key phrase set that contains seed words.Finally,the method selects topic label from the key phrase set according to the integrity and generalization of a phrase.The experiments were made on two corpora.One is topic oriented reports,the other is event based news reports.According to the experimental results,the method work well in extracting a meaningful phrase to represent a topic.
作者 寇宛秋 李芳
出处 《中文信息学报》 CSCD 北大核心 2013年第5期114-121,143,共9页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(60873134)
关键词 话题标签 种子词抽取 bootstrapping算法 topic labelling seed words extraction bootstrapping method
  • 相关文献


  • 1Blei David,Ng Andrew,Jordan Michael.Latent Dirichlet Allocation[J].The Journal of Machine Learning Research,2003,3:993-1022.
  • 2徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436. 被引量:233
  • 3Rosen-Zvi M,Griffiths T,Steyvers M,et al.The author-topic model for authors and documents[C]//Proceedings of the 20th conference on uncertainty in artificial intelligence.AUAI Press,2004:487-494.
  • 4Ruifeng XU,Lu YE.Reader's Emotion Prediction Based on Weighted Latent Dirichlet Allocation and Multi-label k-nearest Neighbor Model[J].Journal of Computational Information System,2013,9:6.
  • 5Johri N,Roth D,Tu Y.Experts' retrieval with multiword-enhanced author topic model.Proceedings of the NAACL HLT 2010 workshop on semantic search[C]//Proceedings of Association for Computational Linguistics,2010:10-18.
  • 6William Darling,Fei Song.Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA[C]//Proceedings of Association for Computational Linguistics.2005.
  • 7Griffiths T L,Steyvers M,Blei D M,et al.Integrating topics and syntax[J].Advances in neural information processing systems,2005,17:537-544.
  • 8Allison J.B.Chaney,David M.Blei.Visualizing Topic Models[C]//Proceedings of Association for the Advancement of Artificial Intelligence.2012.
  • 9Teh Y W,Jordan M I,Beal M J,et al.Hierarchical dirichlet processes[J].Journal of the American Statistical Association,2006,101(476).
  • 10Blei D M,Lafferty J D.Visualizing topics with multiword expressions[J].arXiv preprint arXiv:0907.1013,2009.


  • 1Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990.
  • 2Hofmann T. Probabilistic latent semantic indexing//Proceedings of the 22nd Annual International SIGIR Conference. New York: ACM Press, 1999:50-57.
  • 3Blei D, Ng A, Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 4Griffiths T L, Steyvers M. Finding scientific topics//Proceedings of the National Academy of Sciences, 2004, 101: 5228 5235.
  • 5Steyvers M, Gritfiths T. Probabilistic topic models. Latent Semantic Analysis= A Road to Meaning. Laurence Erlbaum, 2006.
  • 6Teh Y W, Jordan M I, Beal M J, Blei D M. Hierarchical dirichlet processes. Technical Report 653. UC Berkeley Statistics, 2004.
  • 7Dempster A P, Laird N M, Rubin D B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977, B39(1): 1-38.
  • 8Bishop C M. Pattern Recognition and Machine Learning. New York, USA: Springer, 2006.
  • 9Roweis S. EM algorithms for PCA and SPCA//Advances in Neural Information Processing Systems. Cambridge, MA, USA: The MIT Press, 1998, 10.
  • 10Hofmann T. Probabilistic latent semantic analysis//Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Stockholm, Sweden, 1999:289- 296.












使用帮助 返回顶部