基于命名实体n-gram图的文本相似性度量

Text Similarity Measurement Based on n-gram Graph of Named Entity

下载PDF

导出

摘要文本比较在自然语言处理中应用广泛。本文提出了一种新的文本相似性度量方法,该方法利用从文本和n-gram图中提取的命名实体信息来表示文档,使用OpenCalais作为命名实体识别服务,使用JInsect工具箱来构造和管理n-gram,使用文本聚类算法k-Means进行文本相似性度量,使用各种聚类有效性指标对生成的聚类进行评估。 Text comparison is widely used in NLP(Natural Language Processing). This paper proposes a new text similarity measurement method, which uses the named entity information extracted from the text and N-gram graph to represent the document, uses OpenCalais to recognize the named entity, uses JInsect to construct and manage n-gram, and uses the text clustering algorithm k-means to measure the text similarity, and uses various cluster validity indexes to evaluate the generated clusters.

作者于营周显春贾树文 Yu Ying;Zhou Xianchun;Jia Shuwen(Information and Intelligent Engineering College,University of Sanya,Sanya 572000;Rong Chunming Academician Workstation,University of Sanya,Sanya 572000;Saxo Financial Technology Business College,University of Sanya,Sanya 572000)

机构地区三亚学院信息与智能工程学院三亚学院容淳铭院士工作站三亚学院盛宝金融科技商学院

出处《现代计算机》 2022年第2期73-77,共5页 Modern Computer

基金海南省自然科学基金青年项目(621QN270)。

关键词自然语言处理 n-gram图文本聚类文本相似性度量 NLP n-gram graph text clustering text similarity measurement

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1华秀丽,朱巧明,李培峰.语义分析与词频统计相结合的中文文本相似度量方法研究[J].计算机应用研究,2012,29(3):833-836. 被引量：41

二级参考文献13

1车万翔,刘挺,秦兵,等.面向双语句对检索的汉语句子相似度计算[C]//全国第七届计算语言学联合学术会议论文集.北京:清华大学出版社,2003:81-88.
2COELHO T A S, CALADO P P, SOUZA L V, et al. Image retrieval using multiple evidence ranking [ J]. IEEE Trans on Knowledge and Data Engineering, 2004,16 ( 4 ) :408-417.
3KO Y, PARK J, SEO J. Improving text categorization using the im- portance of sentences [ J ]. Information Processing and Manage- ment,2004,40(1) :65-79.
4THEOBALD M, SIDDHARTH J. SpotSigs: robust and efficient near duplicate detection in large Web collection [ C ]//Proc of the 31 st An- nual International,ACM SIGIR Conference on Research and Develop- ment in Information Retrieval. New York:ACM Press,2008:563-570.
5PATWARDHAN S, BANERJEE S, PEDERSEN T. Using measures of semantic relatedness for word sense disambiguation [ C ]//Proc of the 4th International Conference on Intelligent Text Processing and Com- putational Linguistics. 2003:301-308.
6MILLER G. WordNet: a lexical database for English[ J]. Communi- cations of the ACM,1995,38( 11 ) :39-41.
7SALTON G. The SMART retrieval system-experiments in automatic document processing [ M ]. Upper Saddle River: Prentice-Hall, 1971 : 207-214.
8HOTHO A, STAAB S, STUMME G. WordNet improves text docu- ment clustering [ C ]//Proc of SIGIR Semantic Web Workshop. New York:ACM Press,2003:505-514.
9KARYPIS G. CLUTO : a clustering tookit [ R ]. Minneapolis : University of Minnesota,2002.
10KUMAR N. Approximate string matching algorithm[ J]. International Journal on Computer Science and Engineering, 2010,2 ( 3 ) : 641-644.

共引文献40

1杜华.文字云图的英语阅读教学设计与实践——以文字云图工具Wordle为例[J].现代教育技术,2012,22(9):65-69. 被引量：16
2白如江,王晓笛,王效岳.基于数字指纹的文献相似度检测研究[J].图书情报工作,2013,57(15):88-95. 被引量：7
3周由,戴牡红.语义分析与TF-IDF方法相结合的新闻推荐技术[J].计算机科学,2013,40(11A):267-269. 被引量：11
4詹志建,杨小平.基于语言网络和语义信息的文本相似度计算[J].计算机工程与应用,2014,50(5):33-38. 被引量：11
5王庆福,常广炎.基于TF-IDF优化算法在文本分类中的应用研究[J].电脑编程技巧与维护,2014(10):11-12. 被引量：2
6邓一贵,伍玉英.基于文本内容的敏感词决策树信息过滤算法[J].计算机工程,2014,40(9):300-304. 被引量：29
7王蕾.文字云图在英语阅读教学中的应用研究[J].读与写（教育教学刊）,2014,11(6):52-52.
8黄贤英,张金鹏,刘英涛,赵明军.基于词项语义映射的短文本相似度算法[J].计算机工程与设计,2015,36(6):1514-1518. 被引量：11
9周丽杰,于伟海,郭成.基于改进的TF-IDF方法的文本相似度算法研究[J].泰山学院学报,2015,37(3):18-22. 被引量：10
10杨威,朱福喜.基于聚类融合的标题文本聚类方法[J].计算机工程与应用,2015,51(15):129-133. 被引量：2

1刘锟,曾曦,邱梓珩,陈周国.基于RoBERTa-WWM和HDBSCAN的文本聚类算法[J].计算机与现代化,2022(3):48-52. 被引量：1
2李晓璐,赵庆聪,齐林.基于迭代训练的古文短文本聚类方法研究[J].现代计算机,2022,28(2):37-43.
3郭恒睿,王中卿,朱巧明,李培峰.基于半监督学习的中文社交文本事件聚类方法[J].中文信息学报,2022,36(2):152-159. 被引量：3
4张华,应媚,康争光.从《2022美国竞争法案》看美国创新政策取向[J].科技中国,2022(4):97-100. 被引量：1
5陈小强,陈立锋.基于圆投影与径向投影的模板匹配算法[J].计算机科学与应用,2022,12(3):527-534.
6周丹烁.博考慎思追本溯源——评叶岗《〈燕丹子〉研究》[J].绍兴文理学院学报,2022,42(3):118-120.

现代计算机

2022年第2期

浏览历史

内容加载中请稍等...

基于命名实体n-gram图的文本相似性度量

参考文献1

二级参考文献13

共引文献40

相关作者

相关机构

相关主题

浏览历史