Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) meth...Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) method based on virtual HD (High Different from neutral, with large pitch offset) speech synthesizing was proposed to deal with this problem. It enhanced the system performance under mismatch emotion states in MASC, while still suffering the system risk introduced by fusing the scores from the unreliable VHD model and the neutral model with equal weight. In this paper, we propose a new BESR method based on score reliability fusion. Two strategies, by utilizing identification rate and scores average relative loss difference, are presented to estimate the weights for the two group scores. The results on both MASC and EPST shows that by using the weights generated by the two strategies, the BESR method achieve a better performance than that by using the equal weight, and the better one even achieves a result comparable to that by using the best weights selected by exhaustive strategy.展开更多
任务中全局注意力在长距离视频序列上注意力值分布的方差较大,生成关键帧的重要性分数偏差较大,且时间序列节点边界值缺乏长程依赖导致的片段语义连贯性较差等问题,通过改进注意力模块,采用分段局部自注意力和全局自注意力机制相结合来...任务中全局注意力在长距离视频序列上注意力值分布的方差较大,生成关键帧的重要性分数偏差较大,且时间序列节点边界值缺乏长程依赖导致的片段语义连贯性较差等问题,通过改进注意力模块,采用分段局部自注意力和全局自注意力机制相结合来获取局部和全局视频序列关键特征,降低注意力值的方差。同时通过并行地引入双向门控循环网络(bidirectional recurrent neural network,BiGRU),二者的输出分别输入到改进的分类回归模块后再将结果进行加性融合,最后利用非极大值抑制(non-maximum suppression,NMS)和核时序分割方法(kernel temporal segmentation,KTS)筛选片段并分割为高质量代表性镜头,通过背包组合优化算法生成最终摘要,从而提出一种结合多尺度注意力机制和双向门控循环网络的视频摘要模型(local and global attentions combine with the BiGRU,LG-RU)。该模型在TvSum和SumMe的标准和增强数据集上进行了对比试验,结果表明该模型取得了更高的F-score,证实了该视频摘要模型保持高准确率的同时可鲁棒地对视频完成摘要。展开更多
文摘Emotion mismatch between training and testing is one of the important factors causing the performance degradation of speaker recognition system. In our previous work, a bi-model emotion speaker recognition (BESR) method based on virtual HD (High Different from neutral, with large pitch offset) speech synthesizing was proposed to deal with this problem. It enhanced the system performance under mismatch emotion states in MASC, while still suffering the system risk introduced by fusing the scores from the unreliable VHD model and the neutral model with equal weight. In this paper, we propose a new BESR method based on score reliability fusion. Two strategies, by utilizing identification rate and scores average relative loss difference, are presented to estimate the weights for the two group scores. The results on both MASC and EPST shows that by using the weights generated by the two strategies, the BESR method achieve a better performance than that by using the equal weight, and the better one even achieves a result comparable to that by using the best weights selected by exhaustive strategy.
文摘任务中全局注意力在长距离视频序列上注意力值分布的方差较大,生成关键帧的重要性分数偏差较大,且时间序列节点边界值缺乏长程依赖导致的片段语义连贯性较差等问题,通过改进注意力模块,采用分段局部自注意力和全局自注意力机制相结合来获取局部和全局视频序列关键特征,降低注意力值的方差。同时通过并行地引入双向门控循环网络(bidirectional recurrent neural network,BiGRU),二者的输出分别输入到改进的分类回归模块后再将结果进行加性融合,最后利用非极大值抑制(non-maximum suppression,NMS)和核时序分割方法(kernel temporal segmentation,KTS)筛选片段并分割为高质量代表性镜头,通过背包组合优化算法生成最终摘要,从而提出一种结合多尺度注意力机制和双向门控循环网络的视频摘要模型(local and global attentions combine with the BiGRU,LG-RU)。该模型在TvSum和SumMe的标准和增强数据集上进行了对比试验,结果表明该模型取得了更高的F-score,证实了该视频摘要模型保持高准确率的同时可鲁棒地对视频完成摘要。