摘要
为了更加高效地对文本数据进行描述,提出将文本向量表示为统计流形上的点,并用核方法将文本的生成模型和判别模型结合起来.用DCM统计流形上扩散核来表示文本空间上的距离度量,提出DCM流形上的核近邻算法用于文本分类.实验结果表明,在两个实验语料库上基于DCM流形的核近邻算法的准确率和召回率优于对比算法或与对比算法相当.
In order to model text processing effectively, text vectors can be represented as points on statistical manifold and kernels can be used to integrate discriminative and generative model. And then, we present diffuse kernels based on Dirichlet compound multinomial (DCM) manifold. More specifically, we proposed kernel nearest neighbor classifier based on kernel distance metric of DCM manifold to implement text classification task. As demonstrated by our experimental results on various real-world text datasets, we show that our text classifier is more desirable and provides much better computational accuracy than some current state-of-the-art methods.
出处
《北京理工大学学报》
EI
CAS
CSCD
北大核心
2010年第3期315-319,共5页
Transactions of Beijing Institute of Technology
基金
国家部委预研项目(504-4)
关键词
扩散核
核近邻
狄利克雷混合多项式
文本分类
diffuse kernel
kernel nearest neighbor
Diriehlet compound multinomial
text classification