期刊文献+

基于核心词项平均划分相似度的短文本聚类算法

Core term based mean partition similarity for short text clustering
下载PDF
导出
摘要 针对短文本特征极度稀疏、上下文依赖性强等特点,以自顶向下的策略,提出一种基于核心词项平均划分相似度的短文本聚类算法CTMPS。该方法首先在整个短文本语料库中计算词项之间的概率相关性,以此为基础对短文本中词项进行加权,将权值较大的词项作为最能代表该短文本的核心词项形成核心词项集;以信息论为基础,将核心词项作为划分依据计算平均划分相似度,选择平均划分相似度值最大包含该核心词项的短文本形成一类,用此策略反复迭代直到满足要求。最后,实验结果表明,本文提出的方法显著地提高了短文本聚类的性能。 Aiming at the characteristics of extreme sparse and context dependent features of short texts, we propose a novel core term based mean partition similarity for short text clustering algorithm (CTMPS) with top-down strategy. The CTMPS firstly determines probabilistic correlation among terms in the corpus. Secondly, based on the probabilistic correlation,terms in a short text are weighted. The terms with larger weight are considered as the most representative terms of the short text and they then form the core terms set. On the basis of information theory, mean partition similarity (MPS) is calculated via core terms, and the MPS with the maximum core terms in the short text forms one class. Finally, experimental results show that the CTMPS outperforms the baseline algorithm in term of performance and clustering efficiency.
出处 《计算机工程与科学》 CSCD 北大核心 2017年第8期1562-1569,共8页 Computer Engineering & Science
基金 国家自然科学基金(61363058) 甘肃省青年科技基金(145RJYA259) 甘肃省自然科学研究基金(145RJZA232 150RJZA127) 中国科学院计算技术研究所智能信息处理重点实验室开放基金(IIP2014-4) 2016本科生创新能力提升计划--学术科技创新团队资助项目 2016年甘肃省大学生创新创业训练计划(201610736040 201610736041)
关键词 短文本聚类 核心词项 平均划分相似度 概率相关性 short text clustering core terra mean partition similarity probabilistic correlation entropy
  • 相关文献

参考文献5

二级参考文献78

  • 1彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:44
  • 2化柏林.知识抽取中的停用词处理技术[J].现代图书情报技术,2007(8):48-51. 被引量:39
  • 3Fung B C M,Wang K,Ester M.Hierarchical document clustering//Wang John ed.The Encyclopedia of Data Warehousing and Mining,idea Group.2005:970-975.
  • 4Salton G.The SMART Retrieval System-Experiments in Automatic Document Processing.Englewood Cliffs,New Jersey:Prentice Hall Inc,1971.
  • 5Wang Y,Julia H.Document clustering with semantic analysis//Proceedings of the 39th Hawaii International Conferences on System Sciences.Hawaii,US,2006:54-63.
  • 6Hotho A,Staab S,Stumme G.Wordnet improves text document clustering//Proceedings of the Semantic Web Workshop at SIGIR-2003,26th Annual International ACM SIGIR Conference.Toronto,Canada,2003:541-550.
  • 7Hall P,Dowling G.Approximate string matching.Computing Survey,1980,12(4):381-402.
  • 8Coelho T,Calado P,Souza L,Ribeiro-Neto B,Muntz R.Image retrieval using multiple evidence ranking.IEEETransactions on Knowledge and Data Engineering,2004,16(4):408-417.
  • 9Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.lnformation Processing and Management,2004,40(1):65-79.
  • 10Erkan G,Radev D.Lexrank:Graph-based lexical centrality as salience in text summarization.Journal of Artificial Intelligence Research,2004,22(7):457-479.

共引文献273

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部