基于词项语义组合的文本相似度计算方法研究被引量：4

Research on text similarity calculation strategy based on semantic combination of keywords

下载PDF

导出

摘要文本之间在相似度比较时主要考虑关键词的匹配特性,缺乏对关键词间组合关系的深入分析。针对关键词间组合特性,按序组合的关键词数目越大,对文本之间相似度贡献越大,并提出基于关键词组合数目的非线性语义关联性函数,在LCS基础上提取文本中所有关键词组合块。将这种结合关键词组合关系的相似度比较方法运用于短文本的相似度比较中,数据采用微软语义释义语料库,实验结果表明,短文本相似度计算的准确率和F1值都有了提高,其中F1值的提高较为明显。 Similarity comparison between texts is mainly based on keywords matching, while lacking of analysis of combinationrelationship among keywords deeply. Aiming at the combination of keywords, the larger of the sum of keywordswhich appears orderly, the greater significance for the similarity comparison between texts, a novel non-linear semanticrelevance function is proposed based on the sum of keywords combination cooperatively, under the foundation of LCS theory,it extracts all the combination blocks of keywords. The experimental results on an open benchmark dataset fromMicrosoft Research Paraphrase corpus(MSRP)show that the proposed algorithm acquires a well accuracy and F1 performanceparticularly compared with traditional algorithm under the circumstance of short text similarity comparison.

作者周丽杰于伟海郭成 ZHOU Lijie;YU Weihai;GUO Cheng(Electronic Teaching Center, Yantai Vocational College, Yantai, Shandong 264670, China;Yantai Normal Language Teaching Center, Yantai, Shandong 264670, China;School of Software Technology, Dalian University of Technology, Dalian, Liaoning 116620, China)

机构地区烟台职业学院电教中心烟台市普通话培训测试中心大连理工大学软件学院

出处《计算机工程与应用》 CSCD 北大核心 2016年第19期90-93,共4页 Computer Engineering and Applications

基金国家自然科学基金(No.61401060 No.61272173) 山东省高等学校科技计划基金(No.J12LN73)

关键词关键词组合非线性语义关联语义关联函数文本相似度 combination of keywords non-linear semantic relevance semantic relevance function text similarity

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献13

1Banea C,Hassan S,Mohler M,et al.A superivsed synergisticapproach to semantic text similairity[C].Proceedingsof the 1st Joint Conference on Lexical and ComputationalSemantics,2012:635-642.
2Glinos D.Chunk-based determination of semantic textsimilarity[C].Proceedings of the 1st Joint Conference onLexical and Computational Semantics,2012:547-551.
3Jiang Jungyi,Tsai Shianchi,Lee Shiejue.Multi-label textcategorization based on fuzzy similarity and k nearestneighbors[J].Expert Systems with Applications,2012,39(3):2813-2821.
4Gu Yanhui,Yang Zhenglu,Xu Guandong.Exploration on efficient similar sentences extraction[J].World Wide Web-Internet & Web Information Systems,2014,17(4):595-562.
5Islam A,Inkpen D.Semantic text similarity using corpusbasedword similarity and string similarity[J].ACM Transactionson Knowledge Discovery from Data,2008,2(2):1-25.
6Dong Hongni,Zhao Xiaohui,Wu Jiang,et al.Study onthe calculation of text similarity based on key-sentence[C].Proceedings of the International Conference on E-Businessand E-Government,2010:1952-1955.
7Song Wenhe,Ma Chunxia.The study of thesis replicadetecte methods based on similarity of text[C].Proceedingsof 2010 3rd IEEE International Conference on ComputerScience and Information Technology,2010,3:596-600.
8Tasi C S,Huang Y M,Liu C H,et al.Applying VSMand LCS to develop an integrated text retrieval mechanism[J].Expert Systems with Applications,2012,39(4):3974-3982.
9王开云,孔思淇,付云生,潘泽友,马卫东,赵强.两种基于双向比较的最长公共子串算法[J].计算机研究与发展,2013,50(11):2444-2454. 被引量：9
10王防修,周康.基于最长公共子序列的随机路径选择算法设计[J].计算机工程与设计,2014,35(6):2170-2173. 被引量：2

二级参考文献49

1杨宗长.Windows下健壮的随机数发生器设计[J].工程地质计算机应用,2004(3):14-17. 被引量：1
2Skiena S. The Algorithm Design Manual [M]. 2nd ed. Berlin: Springer, 2008.
3Wang Ke, Cretu G, Salvatore S J. Anomalous payload-based worm detection and signature generation [C] //Proc of Detection of Intrusions and Malware &. Vulnerability Assessment. Berlin: Springer, 2006: 227-246.
4Dan G. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology [M]. New York: Cambridge University Press, 1997.
5Matsubara W, Inenaga S, Ishino A, et al. Computing longest common substring and all palindromes from compressed strings [C] //Proc of SOFSEM2008: Theory and Practice of Computer Science. Berlin: Springer, 2008: 364- 375.
6Manber U, Myers G. Suffix arrays: A new method for online string searches [C] / /Proc of the 1 st Annual ACM-SIAM Symp on Discrete Algorithms. Philadelphia, PA: SIAM, 1990: 319-327.
7Kasai T, Lee G, Arimural H, et al. Linear-time longestcommon-prefix computation in suffix arrays and its applications [C] //Proc of Groupware: Design, Implementation. and Use. Berlin: Springer, 2002: 181-192.
8Babenko M. Starikovskaya T. Computing longest common substrings via suffix arrays [C] //Proc of Computer ScienceTheory and Applications. Berlin: Springer, 2008: 64-75.
9许智磊.后缀数组[EB/OL].(2004-01-01)[2012-04-09]http:/wenku.baidu.com/view/cd7db304e87101f69e31953e.html.
10Michael M. Puglisi S J. Faster lightweight suffix array construction [C] //Proc of the 17th Australasian Workshop on Combinatorial Algorithmst A WOCA). Ballarat: School of Information Technology &. Mathematical Sciences, University of Ballarat. 2006: 16-19.

共引文献14

1王开云.两种基于双向比较的最长公共子串算法[J].中国工程物理研究院科技年报,2013(1):167-170. 被引量：2
2叶施仁,孙宁.基于SVM的新浪微博营销类水帖识别研究[J].湘潭大学自然科学学报,2015,37(4):70-74. 被引量：5
3李少芳.基于VB的软件设计关键技术[J].九江学院学报（自然科学版）,2016,31(1):63-66.
4李威杰,华保健,李曦.支持正则表达式的密文检索方案的研究[J].计算机应用与软件,2017,34(3):306-311. 被引量：3
5彭鑫,李仁发,李哲涛,谢国琪.基于路口时延下界的车联网数据分发方案[J].通信学报,2017,38(4):25-34. 被引量：1
6马博林,张铮,刘健雄.应用于动态异构web服务器的相似度求解方法[J].计算机工程与设计,2018,39(1):282-287. 被引量：10
7王曙燕,赵鹏飞,孙家泽.基于多特征的静态软件胎记提取算法[J].计算机应用,2018,38(3):806-811. 被引量：2
8崔彤彤,崔荣一.基于潜在语义分析的文本指纹提取方法[J].中文信息学报,2018,32(5):74-79. 被引量：8
9王守道,蒋玉明,胡大裟.基于区块链的智能合约压缩存储方法[J].现代计算机,2019,25(9):42-46. 被引量：1
10吴小龙,曹存根.基于等价压缩快速聚类的Web表格知识抽取[J].中文信息学报,2019,33(4):75-84. 被引量：1

同被引文献27

1张振豪,过弋,韩美琪,王吉祥.基于关键词相似度的短文本分类方法研究[J].计算机应用研究,2020,37(1):26-29. 被引量：7
2车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7):15-19. 被引量：64
3张奇,黄萱菁,吴立德.一种新的句子相似度度量及其在文本自动摘要中的应用[J].中文信息学报,2005,19(2):93-99. 被引量：34
4黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864. 被引量：221
5王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学,2013,40(12):229-232. 被引量：97
6罗年猛,李雄.基于典型工艺的相似工艺路线检索方法[J].机械工程与自动化,2014(5):101-103. 被引量：1
7张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377. 被引量：18
8石杰,周兰江,线岩团,余正涛.基于WordNet的中泰文跨语言文本相似度计算[J].中文信息学报,2016,30(4):65-70. 被引量：12
9黄贤英,李沁东,刘英涛.结合词性的短文本相似度算法及其在文本分类中的应用[J].电讯技术,2017,57(1):78-82. 被引量：11
10张锡忠,徐建民.基于术语同义关系的文档相似度研究[J].河北大学学报（自然科学版）,2017,37(1):108-112. 被引量：3

引证文献4

1郭玉栋,左金平.大数据下数据库引文索引匹配误差检测仿真[J].计算机仿真,2020,37(2):394-397. 被引量：3
2童伟,王淑营.基于短文本相似度计算的工序卡片相似度计算方法[J].科学技术创新,2021(17):104-106. 被引量：1
3李伊仝,王红斌,程良.融入新闻标题信息的新闻文本与评论的语义相似度计算方法[J].吉林大学学报（理学版）,2022,60(6):1399-1406. 被引量：1
4王红斌,张卓,赖华.结合对比学习的新闻文本与评论相似度计算[J].小型微型计算机系统,2023,44(12):2671-2677.

二级引证文献5

1李鹏基.计算机软件开发中的数据库测试技术研究[J].无线互联科技,2022,19(9):50-52. 被引量：4
2苏湛,黄忠,艾均.一种融合用户相似性与评分距离的个性化推荐算法[J].软件工程,2022,25(10):20-27.
3胡小琴,潘锦锋.大数据相似重复记录检测算法在试题库中的运用[J].成都工业学院学报,2023,26(1):66-69. 被引量：1
4付敏.基于关联特征的英语语料库快速检索方法[J].信息技术,2024,48(2):78-81.
5魏嵬,丁香香,郭梦星,杨钊,刘辉.文本相似度计算方法综述[J].计算机工程,2024,50(9):18-32.

1林华兵,傅清祥.基于文本相似度的关键词分类算法[J].福建电脑,2005,21(8):46-47. 被引量：1
2赵红宇.基于关键词组合向量模型的文本自动分类研究[J].商场现代化,2008(26):20-21.
3徐保根.关于n阶图的最小减控制数[J].华东交通大学学报,2005,22(2):136-138.
4张鹏,黄健,赵鑫业,刘洋.基于BOM组件模型的可组合性研究[J].系统仿真学报,2011,23(8):1559-1562. 被引量：2
5何鑫乐,骆虹希,林梦琳,戴丹.基于淘宝的搜索引擎优化策略研究——以品牌运动鞋类目为主的网店为例[J].电子商务,2014,15(12):49-49. 被引量：1
6檀佳.基于变电站自动化信息的远程数据挖掘系统研究[J].机电信息,2014(24):141-141.
7苏方方.动态规划算法计算组合数C_n^m[J].现代计算机（中旬刊）,2011(10):35-37. 被引量：2
8王豪.图着色问题论述[J].电子技术与软件工程,2014(8):94-94.
9张先勇,李勇.一种基于改进蚁群优化的QoS路由算法[J].计算机与网络,2009,35(10):61-64. 被引量：2
10吕传宇,李华,耿虎.一种适合于专题式元搜索引擎的信息检索策略[J].重庆大学学报（自然科学版）,2004,27(7):90-93. 被引量：1

计算机工程与应用

2016年第19期

浏览历史

内容加载中请稍等...

基于词项语义组合的文本相似度计算方法研究被引量：4

参考文献13

二级参考文献49

共引文献14

同被引文献27

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于词项语义组合的文本相似度计算方法研究 被引量：4

参考文献13

二级参考文献49

共引文献14

同被引文献27

引证文献4

二级引证文献5

相关作者

相关机构

相关主题

浏览历史

基于词项语义组合的文本相似度计算方法研究被引量：4