基于LSA-HMM的新闻主题分割

News Topic Segmentation Based on LSA-HMM

下载PDF

导出

摘要主题分割技术是快速并有效地对新闻故事节目进行检索和管理的基础。传统的基于隐马尔可夫模型(HiddenMarkov Model,HMM)的主题分割技术仅使用主题和主题之间的转移寻找主题边界进行新闻分割,并未考虑各主题中词与词之间存在的潜在语义关系。本文提出一种基于隐马尔科夫模型的改进算法。该算法使用潜在语义分析(Latent Se-mantic Analysis,LSA)对词频向量进行特征提取和降维,考虑了词与词之间的上下文关系,通过聚类得到文档类别信息,以LSA特征和主题类别作为HMM的观测和隐状态,这样同时考虑了主题之间的关系,最终实现对文本主题分割。数据实验表明,该算法具有较好的分割性能。 Topic segmentation is the basic of efficiently retrieving and managing news story programs.Traditional topic segmentation technique based on Hidden Markov Model（HMM） only uses the transition of each topic to segment news by searching for the topic boundary,this does not take into account the latent semantic relationship between each word in topics.This paper proposes an improved algorithm based on HMM,the algorithm uses the LSA as dimensionality reduction and feature extraction method on the word frequency vectors,considering the context relationship among words.During the training step,the class label is extracted from the document through the K-means clustering process.The LDA features and the labels are considered as the observation of the hidden states in the HMM,respectively,which also take into account the impact between different topics.Thus,the topic segmentation is implemented.From the results of extensive experiments,the proposed model presents good capability to conduct the task of segmenting the news document.

作者史倩

机构地区西北工业大学计算机学院

出处《计算机与现代化》 2012年第5期27-30,34,共5页 Computer and Modernization

关键词主题分割隐马尔可夫模型主题模型潜在语义分析 topic segmentation hidden Markov model topic model latent semantic analysis

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献14

1Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation[J]. Machine Learning, 1999,34(1-3) :177-210.
2Manning C D. Rethinking Text Segmentation Models:An Information Extraction Case Study [ R~. Technical Report, University of Sydney, 1998.
3Dharanipragada S, Franz M, MeCarley J, et al. Story seg- mentation and topic detection in the broadcast news domain [C]//Proceedings of the DARPA Broadcast News Work- shop. 1999:1-4.
4Hearst M A. Texttiling: Segmenting text into muhi-para- graph subtopic passages [ J ]. Computational Linguistics, 1997,23(1) :33-64.
5Stokes N, Carthy J, Smeaton A F. SeLeCT: A lexical cohe- sion based news story segmentation system [ J ]. Journal of AI Communications, 2004,17 ( 1 ) :3-12.
6Yamron J P, Carp I, Gillick L, et al. A hidden Markov model approach to text segmentation and event tracking [ C]//Proceedings of ICASSP. 1998:333-336.
7Ponte J M, Croft W B. Text segmentation by topic[C]// Proceedings of the First European Conference on Researchand Advanced Technology for Digital Libraries. 1997: 120-129.
8Hofmann T. Unsupervised learning by probabilistic latent semantic analysis [ J ]. Machine Learning Journal, 2001,42 (1) :177-196.
9Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2003,3(3) : 993-1022.
10刘云中,林亚平,陈治平.基于隐马尔可夫模型的文本信息抽取[J].系统仿真学报,2004,16(3):507-510. 被引量：51

二级参考文献13

1[1]A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building Domain-Specific Search Engines [A]. In Proceedings of IJCAI-99 [C]. 622-667.
2[2]Ellien Riloff. Automatically Constructing a Dictionary for Information Extraction Task [A]. Proceeding for the Eleventh National Conference on Artificial Intelligence [C]. 1993. 811-816.
3[3]E. Riloff , R. Jones. Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping [A]. Proceedings of the Sixteenth National Conference on Artificial Intelligence [C]. 1999. 811-816.
4[4]S. Soderland. Learning information extraction rules for semi-structured and free text [J]. Machine Learning, 1999, 1-44.
5[5]Kushmerick, N. Wrapper induction: efficiency and Expressiveness [J]. Artificial Intelligence,2000, Vol. 118, pp. 15--68.
6[6]Leek,T. R. Information Extraction Using Hidden Markov Models [D]. Master's thesis, UC san Diego,1997.
7[7]Kristie Seymore, Andrew McCallum, Ronal Rosenfel. Learning Hidden Markov Model Structure for Information Extract [A]. AAAI' 99 Workshop on Machine Learning for Information Extraction [C]. 1999. 37-42.
8[8]Dayne Frietag, Andrew McCallum. Information Extraction with HMMs and shrinkage [A]. In Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction [C], 1999, pp. 31-36.
9[9]Freitag, D., & McCallum, A. Information extraction with HMM structures learned by stochastic optimization [A]. Proceedings of the Eighteenth Conference on Artificial Intelligence [C]. 2000.584-589.
10[10]Freitag, D., McCallum, A., and Pereira F. Maximum Entropy Markov Models for Information Extraction and Segmentation [A]. In proceedings of ICML-2000 [C]. 591-598.

共引文献50

1王敬普,林亚平,周顺先,岳文.基于包装器模型的文本信息抽取[J].计算机应用,2006,26(3):655-658. 被引量：8
2王雷,陈治平,李志成.基于文本分块的多模板隐马尔可夫模型的文本信息抽取[J].山东大学学报（理学版）,2006,41(3):25-28. 被引量：4
3顾铮,顾平.信息抽取技术在中医研究中的应用[J].医学信息（西安上半月）,2007,20(1):27-30. 被引量：11
4聂哲,顾明.基于XML的政府公文信息抽取中间件的设计与实现[J].计算机工程与设计,2007,28(5):1158-1160.
5郑彦宁,化柏林,张新民.信息检索与信息抽取差异性探析[J].图书情报工作,2007,51(10):17-20. 被引量：1
6于江德,樊孝忠,尹继豪,顾益军.基于隐马尔可夫模型的中文科研论文信息抽取[J].计算机工程,2007,33(19):190-192. 被引量：9
7周顺先,林亚平,王耀南,易叶青.基于聚簇隐马尔可夫模型的文本信息抽取[J].系统仿真学报,2007,19(21):4926-4931. 被引量：2
8于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取[J].华南理工大学学报（自然科学版）,2007,35(9):90-94. 被引量：11
9于江德,樊孝忠,尹继豪.隐马尔可夫模型在自然语言处理中的应用[J].计算机工程与设计,2007,28(22):5514-5516. 被引量：14
10王静,姚勇,刘志镜.基于广义隐马尔可夫模型的网页信息抽取方法[J].山东大学学报（理学版）,2007,42(11):49-52. 被引量：3

1段晓丽,王宇.基于主题分割与PageRank算法的文本主题抽取[J].现代图书情报技术,2010(12):34-39. 被引量：2
2业界焦点[J].软件指南,2007(3):4-4.
3罗玉华,左军,李岩.SVM及其在文本分类中的应用[J].科技信息,2010(3):49-50. 被引量：3
4徐超,王萌,何婷婷,张勇.基于局部主题关键句抽取的自动文摘方法[J].计算机工程,2008,34(22):49-51. 被引量：5
5石晶.文本分割综述[J].计算机工程与应用,2006,42(35):155-159. 被引量：4
6李志宇,梁循,周小平.基于属性主题分割的评论短文本词向量构建优化算法[J].中文信息学报,2016,30(5):101-110. 被引量：6
7刘宏波,雷利娟,阳冬德.E-MAN新技术[J].光通信技术,2003,27(4):35-37.
8罗建利.基于DOM的Web文本分割[J].图书情报工作,2009,53(4):116-120.
9华秀丽,朱巧明,李培峰.语义分析与词频统计相结合的中文文本相似度量方法研究[J].计算机应用研究,2012,29(3):833-836. 被引量：42
10梁惠敏.解析基于故事的新闻视频事件专题分析方法[J].艺术科技,2013,26(6):65-65. 被引量：2

计算机与现代化

2012年第5期

浏览历史

内容加载中请稍等...

基于LSA-HMM的新闻主题分割

参考文献14

二级参考文献13

共引文献50

相关作者

相关机构

相关主题

浏览历史