主题分割技术是快速并有效地对新闻故事节目进行检索和管理的基础。传统的基于隐马尔可夫模型(HiddenMarkov Model,HMM)的主题分割技术仅使用主题和主题之间的转移寻找主题边界进行新闻分割,并未考虑各主题中词与词之间存在的潜在语义关系。本文提出一种基于隐马尔科夫模型的改进算法。该算法使用潜在语义分析(Latent Se-mantic Analysis,LSA)对词频向量进行特征提取和降维,考虑了词与词之间的上下文关系,通过聚类得到文档类别信息,以LSA特征和主题类别作为HMM的观测和隐状态,这样同时考虑了主题之间的关系,最终实现对文本主题分割。数据实验表明,该算法具有较好的分割性能。
Topic segmentation is the basic of efficiently retrieving and managing news story programs.Traditional topic segmentation technique based on Hidden Markov Model(HMM) only uses the transition of each topic to segment news by searching for the topic boundary,this does not take into account the latent semantic relationship between each word in topics.This paper proposes an improved algorithm based on HMM,the algorithm uses the LSA as dimensionality reduction and feature extraction method on the word frequency vectors,considering the context relationship among words.During the training step,the class label is extracted from the document through the K-means clustering process.The LDA features and the labels are considered as the observation of the hidden states in the HMM,respectively,which also take into account the impact between different topics.Thus,the topic segmentation is implemented.From the results of extensive experiments,the proposed model presents good capability to conduct the task of segmenting the news document.
Computer and Modernization