摘要
话题跟踪是信息处理中的一项重要技术,如何提取鲁棒的话题样本特征是其中的研究重点。针对样本中的话题偏移问题,提出一种基于核主成分分析的算法。该算法首先利用开发集的先验知识构建加权矩阵;然后采用核主成分分析对样本进行话题偏移补偿,从而有效地去除了话题偏移的影响,提升了样本特征的鲁棒性;最后通过K-最近邻(K-nearest neighbor,KNN)和Rocchio算法进行分类。在Fisher英文数据库的话题跟踪测试结果表明,相对于基线系统,该系统在检测代价上有15%~18%的相对降低。
Topic tracking is important in information processing with robust feature extraction as a key research point.This paper describes a topic tracking system based on kernel principal component analysis(KPCA) to resolve the topic drift problem.The algorithm first computes a weighted matrix using topic prior knowledge in the development set.The KPCA based algorithm is then used for each topic sample to compensate for drift and to enhance the robustness of the sample features.Finally,the K-nearest neighbor(KNN) and Rocchio methods are used as classifiers to track each topic sample.Tests using the Fisher English transcript corpus show that this system reduces the detection cost by 15%-18% compared with the baseline system.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2013年第6期865-868,共4页
Journal of Tsinghua University(Science and Technology)
关键词
话题跟踪
核主成分分析
话题偏移
特征提取
topic tracking
kernel principal component analysis
topic drift
feature extraction