期刊文献+

融合耦合距离区分度和强类别特征的短文本相似度计算方法 被引量:12

Combining Coupled Distance Discrimination and Strong Classification Features for Short Text Similarity Calculation
下载PDF
导出
摘要 短文本相似度计算在社会网络、文本挖掘和自然语言处理等领域中起着至关重要的作用.针对短文本内容简短、特征稀疏等特点,以及传统的短文本相似度计算忽略类别信息等问题,提出一种融合耦合距离区分度和强类别特征的短文本相似度计算方法.一方面,在整个短文本语料库中利用两个共现词之间的距离计算词项共现距离相关度,并以此来对词项加权从而捕获词项间内联和外联关系,得到短文本的耦合距离区分度相似度;另一方面,基于少量带类别标签的监督数据提取每类中强类别区分能力的特征项作为强类别特征集合,并利用词项的上下文来对强类别特征语义消歧,然后基于文本间包含相同类别的强类别特征数量来衡量文本间的相似度.最后,本文结合耦合距离区分度和强类别特征来衡量短文本的相似度.经实验证明本文提出的方法能够提高短文本相似度计算的准确率. Text similarity measures play a vital role in text related applications in tasks such as social networks,text mining,natural language processing,and others.The typical characteristics of short texts demonstrate severe sparseness and high dimension while the traditional short texts similarity calculation always ignores category information.A coupled distance discrimination and strong classification features based approach for short text similarity calculation,CDDCF,is presented.On the one hand,co-occurrence distance between terms are considered in each text to determine the co-occurrence distance correlation,based on which the weight for each term can be determined and the intra and inter relations between words are established.The similarity of coupling distance discrimination on short text can be captured.On the other hand,strong classification features are extracted via labeled texts.The similarity between two short texts is measured by using the common number of strong discrimination features with the same context.Finally,the distance discrimination and strong classification features are unified into a joint framework to measure the similarity of short texts.Experimental results show that CDDCF performs better compared to baseline algorithms in term of its performance and efficiency of similarity computation.
作者 马慧芳 刘文 李志欣 蔺想红 MA Hui-fang;LIU Wen;LI Zhi-xin;LIN Xiang-hong(College of Computer Science and Engineering,Northwest Normal University,Lanzhou,Gansu 730000,China;Guangxi Key Laboratory of Trusted Software,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China;Guangxi Key Lab of Multi-source Information Mining and Security,Guangxi Normal University,Guilin,Guangxi 541004,China)
出处 《电子学报》 EI CAS CSCD 北大核心 2019年第6期1331-1336,共6页 Acta Electronica Sinica
基金 国家自然科学基金(No.61762078,No.61363058,No.61663004) 广西多源信息挖掘与安全重点实验室开放基金项目(No.MIMS18-08) 广西可信软件重点实验室研究课题(No.KX201705)
关键词 文本挖掘 自然语言处理 文本聚类 社会网络 耦合关系 特征提取 语义消歧 相似度计算 text mining natural language processing text clustering social network couplingrelation feature extraction word sense disambiguation similarity computation
  • 相关文献

参考文献3

二级参考文献34

  • 1同义词词林扩展版[EB/OL].[2011-03-03].http://www.ir-lab.org/.
  • 2Chapelle O, Seholkopf B, Zien A. Semi-Supervised Learning. Cam- bridge, USA : MIT Press, 2006.
  • 3Zhu X J. Semi-Supervised Learning Literature Survey [ EB/OL]. [ 2008 - 07 - 19 ]. http ://www. leexiang, com/semi-supervised- learning-literature -survey.
  • 4Zhong S. Semi-Supervised Model-based Document Clustering: A Comparative Study. Machine Learning, 2006, 65( 1 ) : 3-29.
  • 5Dueck D, Frey B J. Non-metric Affinity Propagation for Unsuper- vised Image Categorization// Proc of the 11 th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil, 2007:1-8.
  • 6Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: An R Pack- age for Affinity Propagation Clustering. Bioinfornmties, 2011, 27(17) : 2463-2464.
  • 7Wu X J. An Improved Extreme Learning Machine for Classification Problem Based on Affinity Propagation Clustering. International Journal of Advancements in Computing Technology, 2012, 4 (10) : 274 -280.
  • 8Culp M, Michailidis G. Graph-Based Semi-Supervised Learning. IEEE Trans on Pattern Analysis and Machine Intelligence, 2008, 30(1) : 174-179.
  • 9Huang R Z, Lam W. An Active Learning Framework for Semi- Supervised Document Clustering with Language Modeling. Data & Knowledge Engineering, 2009, 68 ( 1 ) : 49-67.
  • 10Zhao Y, Karypis G. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering. Machine Learning, 2004, 55(3): 311-331.

共引文献57

同被引文献101

引证文献12

二级引证文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部