摘要
现如今,影视剧的海量增长给其有效管理带来了巨大挑战,而其中的角色识别在影视剧内容管理中具有重大意义。传统的角色识别主要采用依赖于训练样本质量的有监督学习,而现实中一般难以获得充足的训练样本。针对影视剧中的角色识别,提出一种跨模态的无监督说话人识别方法:首先基于声学特征和时间近邻性的音频聚类获得对应聚类结果的音频标记序列;然后通过剧本解析获得对应说话人、说话内容、说话时间的文本标记序列;接着将音频序列与文本序列进行跨模态序列匹配,构造满射解出最小编辑距离,从而实现说话人识别。实验结果表明,在训练集较少的情况下该方法比有监督方法具有更高识别率。
Nowadays the explosive growth of film and TV dramas bring great challenges to their effective management,and in which the role recognition is of great significance in film and TV drama content management. Traditional role recognition mainly depends on the supervised learning of training sample quality,however in reality it is difficult to gain sufficient training samples. This paper proposes an unsupervised speaker recognition method which is based on cross-modal aiming at role recognition in films and TV dramas. The steps are as follows: First,based on acoustic features and audio clustering of time proximity we obtain the audio marking sequence of corresponding clustering result. Secondly,through scripts parsing we obtain the text marking sequence of corresponding speaker,speaking contents and speaking time. Finally we make cross-modal sequence alignment of these two sequences and construct the surjection to calculate minimum Levenshtein distance,so as to achieve speaker recognition. Experimental results show that under the circumstance of sparse training data sets this method has higher recognition rate than the supervised method.
出处
《计算机应用与软件》
CSCD
2016年第5期132-135,147,共5页
Computer Applications and Software
基金
国家自然科学基金重点项目(61231015)
关键词
说话人识别
说话人聚类
编辑距离
混合高斯模型
序列匹配
Speaker recognition
Speaker clustering
Levenshtein distance
Gaussian mixture model
Sequence alignment