摘要
现有的文本相似度度量方法主要采用TF-IDF方法,把文本建模为词频向量,但未考虑文本的结构特征。现将文本的结构特征和TF-IDF方法进行融合,提出了一种面向科技项目文本的相似度度量方法。该方法首先对文本进行预处理,其次根据文本的结构特征提取模块文本,然后使用TF-IDF方法提取每个模块文本的TOP-N关键词,作为模块文本的特征向量表示,最后使用余弦聚类计算文本的相似度。实验结果表明,在电力行业的科技项目文档数据集上,所提方法优于TF-IDF方法。
Existing text similarity measurements often use the TF-IDF method to model texts as term frequency vectors without considering the structural features of texts.This paper combines the structural features of texts with the TF-IDF method and proposes a text similarity measurement for science and technology project texts.This approach firstly pre-processes a text and extracts module texts according to its structural features.After applying the TF-IDF method to these extracted module texts,this method extracts the top keywords of each module text,obtains its feature vector representation,and finally uses cosine formula to calculate the similarity of two texts.By comparing with the TF-IDF method,experimental results show that the proposed method can promote the evaluation metrics of F-measure.
作者
赵晓平
马文
刘雪萍
陈达
Zhao Xiaoping;Ma Wen;Liu Xueping;Chen Da(Information Center,Yunnan Power Grid Co.,Ltd.,Kunming 650011,China;Yunnan Yundian Tongfang Technology Co.,Ltd.,Kunming 650220,China)
出处
《电子技术应用》
2020年第5期31-34,39,共5页
Application of Electronic Technique
基金
国家自然科学基金项目(61702442)。