摘要
目前对于中文影视剧本的分类主要借助人工经验,具有成本高、效率低等特点.当前没有针对中文影视剧本主题自动分类的相关研究,本文将对主题提取进行研究,传统主题生成模型借助于文档和段落、段落和语句、语句和词的相似性,而忽略了文本语句与语句之间的相似性.首先,采用ISOMAP方法降低样本集的向量空间维度;其次,提出交叉熵结合困惑度的算法模型,进而确定LDA需要提取的最优主题数目;最后,通过剧本-主题的方式,利用LDA算法挖掘剧本的隐含主题词,同时利用SVM对主题词做出进一步的分类.
At present,the classification of Chinese film and television scripts mainly relies on manual experience,which has the characteristics of high cost and low efficiency.There is currently no research on the automatic classification of Chinese film and television scripts.This paper explores the topic extraction.The traditional topic generation model relies on the similarity of documents and paragraphs,paragraphs and sentences,sentences and words,while ignoring the similarity between text statements and statements.Firstly,the ISOMAP method is used to reduce the vector space dimension of the sample set.Secondly,the algorithm model of cross entropy combined with perplexity is proposed to determine the optimal number of topics that LDA needs to extract.Based on the above,through the script-theme method,the script is used to mine implicit subject terms of the script,while using SVM to further classify the subject words.
作者
薛佳奇
杨凡
XUE Jiaqi;YANG Fan(School of Information and Control Engineering,Xi'an University of Architecture and Technology,Xi'an 710055,China;School of Science,Xi'an University of Architecture and Technology,Xi'an 710055,China)
出处
《智能计算机与应用》
2019年第4期45-50,共6页
Intelligent Computer and Applications
关键词
中文影视剧本
ISOMAP降维
LDA
交叉熵
困惑度
SVM
Chinese film and television script
ISOMAP dimension reduction
LDA
cross entropy
perplexity
SVM