期刊文献+

基于命名实体n-gram图的文本相似性度量

Text Similarity Measurement Based on n-gram Graph of Named Entity
下载PDF
导出
摘要 文本比较在自然语言处理中应用广泛。本文提出了一种新的文本相似性度量方法,该方法利用从文本和n-gram图中提取的命名实体信息来表示文档,使用OpenCalais作为命名实体识别服务,使用JInsect工具箱来构造和管理n-gram,使用文本聚类算法k-Means进行文本相似性度量,使用各种聚类有效性指标对生成的聚类进行评估。 Text comparison is widely used in NLP(Natural Language Processing). This paper proposes a new text similarity measurement method, which uses the named entity information extracted from the text and N-gram graph to represent the document, uses OpenCalais to recognize the named entity, uses JInsect to construct and manage n-gram, and uses the text clustering algorithm k-means to measure the text similarity, and uses various cluster validity indexes to evaluate the generated clusters.
作者 于营 周显春 贾树文 Yu Ying;Zhou Xianchun;Jia Shuwen(Information and Intelligent Engineering College,University of Sanya,Sanya 572000;Rong Chunming Academician Workstation,University of Sanya,Sanya 572000;Saxo Financial Technology Business College,University of Sanya,Sanya 572000)
出处 《现代计算机》 2022年第2期73-77,共5页 Modern Computer
基金 海南省自然科学基金青年项目(621QN270)。
关键词 自然语言处理 n-gram图 文本聚类 文本相似性度量 NLP n-gram graph text clustering text similarity measurement
  • 相关文献

参考文献1

二级参考文献13

  • 1车万翔,刘挺,秦兵,等.面向双语句对检索的汉语句子相似度计算[C]//全国第七届计算语言学联合学术会议论文集.北京:清华大学出版社,2003:81-88.
  • 2COELHO T A S, CALADO P P, SOUZA L V, et al. Image retrieval using multiple evidence ranking [ J]. IEEE Trans on Knowledge and Data Engineering, 2004,16 ( 4 ) :408-417.
  • 3KO Y, PARK J, SEO J. Improving text categorization using the im- portance of sentences [ J ]. Information Processing and Manage- ment,2004,40(1) :65-79.
  • 4THEOBALD M, SIDDHARTH J. SpotSigs: robust and efficient near duplicate detection in large Web collection [ C ]//Proc of the 31 st An- nual International,ACM SIGIR Conference on Research and Develop- ment in Information Retrieval. New York:ACM Press,2008:563-570.
  • 5PATWARDHAN S, BANERJEE S, PEDERSEN T. Using measures of semantic relatedness for word sense disambiguation [ C ]//Proc of the 4th International Conference on Intelligent Text Processing and Com- putational Linguistics. 2003:301-308.
  • 6MILLER G. WordNet: a lexical database for English[ J]. Communi- cations of the ACM,1995,38( 11 ) :39-41.
  • 7SALTON G. The SMART retrieval system-experiments in automatic document processing [ M ]. Upper Saddle River: Prentice-Hall, 1971 : 207-214.
  • 8HOTHO A, STAAB S, STUMME G. WordNet improves text docu- ment clustering [ C ]//Proc of SIGIR Semantic Web Workshop. New York:ACM Press,2003:505-514.
  • 9KARYPIS G. CLUTO : a clustering tookit [ R ]. Minneapolis : University of Minnesota,2002.
  • 10KUMAR N. Approximate string matching algorithm[ J]. International Journal on Computer Science and Engineering, 2010,2 ( 3 ) : 641-644.

共引文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部