一种面向隐含主题的上下文树核

A Context Tree Kernel Based on Latent Semantic Topic

下载PDF

导出

摘要该文针对上下文树核用于文本表示时缺乏语义信息的问题,提出了一种面向隐含主题的上下文树核构造方法。首先采用隐含狄利克雷分配将文本中的词语映射到隐含主题空间,然后以隐含主题为单位建立上下文树模型,最后利用模型间的互信息构造上下文树核。该方法以词的语义类别来定义文本的生成模型,解决了基于词的文本建模时所遇到的统计数据的稀疏性问题。在文本数据集上的聚类实验结果表明,文中提出的上下文树核能够更好地度量文本间主题的相似性,提高了文本聚类的性能。 The lack of semantic information is a critical problem of context tree kernel in text representation.A context tree kernel method based on latent topics is proposed.First,words are mapped to latent topic space through Latent Dirichlet Allocation（LDA）.Then,context tree models are built using latent topics.Finally,context tree kernel for text is defined through mutual information between the models.In this approach,document generative models are defined using semantic class instead of words,and the issue of statistic data sparse is solved.The clustering experiment results on text data set show,the proposed context tree kernel is a better measure of topic similarity between documents,and the performance of text clustering is greatly improved.

作者徐超周一民沈磊

机构地区北京航空航天大学计算机学院

出处《电子与信息学报》 EI CSCD 北大核心 2010年第11期2695-2700,共6页 Journal of Electronics & Information Technology

关键词文本聚类上下文树核统计语言模型隐含狄利克雷分配(LDA) Text clustering Context tree kernel Statistical language models Latent Dirichlet Allocation（LDA）

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Srivastava A N and Sahami M. Text Mining: Classification, Clustering, and Applications[M]. Boca Raton: Chapman and Hall, 2009: 1-25.
2Cristianini N, Shawe-Taylor J, and Lodhi H. Latent semantic kernels[J]. Journal of Intelligent Information Systems, 2002, 18(2/3): 127-152.
3Nyffenegger M, Chappelier J C, and Gaussier E. Revisiting Fisher kernels for document similarities[C]. 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006: 727-734.
4Lehmann A and Shawe-Taylor J. A probabilistic model for text kernels[C]. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006: 537-544.
5Cuturi M and Vert J P. The context-tree kernel for strings[J]. Neural Networks, 2005, 18(8): 1111-1123.
6Yin Chuan-huan, Tian Sheng-feng, and Mu Shao-min, et al.. Efficient computations of gapped string kernels based on suffix kernel[J]. Neurocomputing, 2008, 71(4-6): 944-962.
7Vert J P. Text categorization using adaptive context trees[C]. Proceedings of the Second International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, February 18-24, 2001: 423-436.
8Willems F M J, Shtarkov Y M, and Tjalkens T J. The context-tree weighting method: basic properties[J]. IEEE Transactions on Information Theory, 1995, 41(3): 653-664.
9Vert J P. Adaptive context trees and text clustering[J]. IEEE Transactions on Information Theory, 2001, 47(5): 1884-1901.
10李晓光,于戈,王大玲,鲍玉斌.基于信息论的潜在概念获取与文本聚类[J].软件学报,2008,19(9):2276-2284. 被引量：7

二级参考文献1

1宫秀军,史忠植.基于Bayes潜在语义模型的半监督Web挖掘[J].软件学报,2002,13(8):1508-1514. 被引量：28

共引文献6

1翁彧,胡长军,席强,张学春.一种面向e-Science环境的多领域Web文本特征抽取模型[J].小型微型计算机系统,2011,32(1):17-23.
2陈叶旺,王华珍,李海波,钟必能,陈锻生.基于百度百科与文本分类的网络文本语义主题抽取方法[J].小型微型计算机系统,2012,33(12):2605-2610. 被引量：9
3程春雷,夏家莉,曹重华,李光泉,曹中华.关系概念的Web文本主题抽取模型研究[J].小型微型计算机系统,2016,37(5):972-977. 被引量：1
4杜鹃,马莉.信息论在数据挖掘领域中的应用[J].电脑知识与技术（过刊）,2010,0(35):9934-9936. 被引量：1
5娄铮铮,叶阳东.基于最大化交叉互信息的对称IB算法[J].计算机学报,2016,39(8):1515-1527. 被引量：3
6张仪,陈国,张再跃.可增量的用户短文本聚类方法研究[J].计算机技术与发展,2017,27(11):83-87.

1李伟,马永征,沈一.一种解决“中心主题湮没问题”的基于图模型的Labeled-LDA文本分类算法[J].计算机科学,2014,41(3):223-227. 被引量：4
2史庆伟,郭朋亮.基于LDA的条件随机场主题模型研究[J].计算机工程与应用,2015,51(7):131-135. 被引量：1
3李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620-627. 被引量：103
4田宝明,戴新宇,陈家骏.一种基于随机森林的多视角文本分类方法[J].中文信息学报,2009,23(4):48-54. 被引量：9
5王方,成颖,柯青.基于混合模型的文本聚类研究综述[J].情报学报,2015,34(5):536-548.
6邹晓辉,孙静.LDA主题模型[J].智能计算机与应用,2014,4(5):105-106. 被引量：17
7魏巍,张艳宁.基于半监督隐含狄利克雷分配的人脸姿态判别方法[J].山东大学学报（工学版）,2011,41(3):17-22.
8王力,李培峰,朱巧明.一种基于LDA模型的主题句抽取方法[J].计算机工程与应用,2013,49(2):160-164. 被引量：10
9江雨燕,李平,王清.用于多标签分类的改进Labeled LDA模型[J].南京大学学报（自然科学版）,2013,49(4):425-432. 被引量：12
10唐晓丽,白宇,张桂平,蔡东风.一种面向聚类的文本建模方法[J].山西大学学报（自然科学版）,2014,37(4):595-600. 被引量：8

电子与信息学报

2010年第11期

浏览历史

内容加载中请稍等...

一种面向隐含主题的上下文树核

参考文献13

二级参考文献1

共引文献6

相关作者

相关机构

相关主题

浏览历史