摘要
领域问题分类在问答系统中占有重要地位,但目前面向特定领域的研究较少。针对领域问题文本篇幅较短、数据稀疏的特点,提出基于主题扩展的领域问题分类方法。该方法主要包括特征选择和特征扩展2个部分。利用卡方统计量特征选择方法,将问题文本选择的特征词作为特征扩展的依据。通过潜在狄利克雷分配主题模型对外部知识库进行分析,得到对应的主题分布。为避免引入噪声主题,采用主题熵的方法得到优质主题。将优质主题下所覆盖的词扩充到问题文本中,最后利用支持向量机分类器对问题文本进行分类。实验结果表明,与传统TFIDF文本分类方法相比,该方法分类效果较好,可提高问答系统的性能。
Domain question classification plays a central role in Question and Answering (Q&A) systems. Lots of current research work on question classification focuses on open domains while few of them pays attention to special domains. The domain questions are always short and have the issue of data sparseness. Hence, this paper proposes a method for domain question classification based on topic expansion. This algorithm mainly consists of two components: feature selection and feature expansion. It first extracts feature words, which are the bases of feature expansion, from raw question text through feature selection method CHI. Then it uses Latent Dirichlet Allocation (LDA) topic model to analyze the universal dataset to obtain the topic distribution. To avoid noisy topics, this paper adopts topic entropy to obtain high quality topics. Finally, it expands question text using the words from high quality topics and classifies the expanded question text using Support Vector Machine (SVM). Experimental results show that the proposed method performs better than the traditional text classification method TFIDF and is helpful to improve the performance of Q&A systems.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第9期202-207,213,共7页
Computer Engineering
基金
上海市科学技术委员会科研计划基金资助项目(1451110700
14511106803)
上海张江国家自主创新示范区专项发展基金资助项目(201411-JA-B108-002)
关键词
领域问题分类
数据稀疏
特征选择
主题模型
优质主题
特征扩展
] domain question classification
data sparseness
feature selection
topic model
high quality topic
feature expansion