摘要
文本比较在自然语言处理中应用广泛。本文提出了一种新的文本相似性度量方法,该方法利用从文本和n-gram图中提取的命名实体信息来表示文档,使用OpenCalais作为命名实体识别服务,使用JInsect工具箱来构造和管理n-gram,使用文本聚类算法k-Means进行文本相似性度量,使用各种聚类有效性指标对生成的聚类进行评估。
Text comparison is widely used in NLP(Natural Language Processing). This paper proposes a new text similarity measurement method, which uses the named entity information extracted from the text and N-gram graph to represent the document, uses OpenCalais to recognize the named entity, uses JInsect to construct and manage n-gram, and uses the text clustering algorithm k-means to measure the text similarity, and uses various cluster validity indexes to evaluate the generated clusters.
作者
于营
周显春
贾树文
Yu Ying;Zhou Xianchun;Jia Shuwen(Information and Intelligent Engineering College,University of Sanya,Sanya 572000;Rong Chunming Academician Workstation,University of Sanya,Sanya 572000;Saxo Financial Technology Business College,University of Sanya,Sanya 572000)
出处
《现代计算机》
2022年第2期73-77,共5页
Modern Computer
基金
海南省自然科学基金青年项目(621QN270)。