摘要
文本主题词提取及相似度计算在搜索引擎、智能问答等自然语言处理的很多研究领域都有着广泛的应用,国内外的专家学者对此有着广泛的研究,但大多都采用了复杂的数学模型,实现起来较为麻烦。对此,在文本主题词提取中,采用提取出文档中除开常见停用词之外的高频词,并通过对比高频词向量之间的夹角余弦,剔除与其他高频词相差最大的噪声高频词的方法,得到文档最终的主题词。再根据提取出的主题词,通过比较两篇文档的主题词的词向量之和之间的夹角余弦,以此来判断两篇文档之间的相似度。基于此思路,开发了一款基于win Form技术的文本主题提取及相似度计算软件系统,实现较为简单,效果超出单纯通过文档向量来判定文档相似度的方法。
Text Topic Extraction and Similarity Calculation was widely used in many normal language process research fields,such as search engine,intelligent question and answer and etc. Researchers at home and abroad carried out wide research on the text topic extraction and similarity calculation. However,methods taken advantage of almost adopt complex mathematical model. Thus it is difficult to realized. As a result,extracting high frequency words without stop words and noisy words as text topic words was taken advantage of in text topic extraction,by the word vector and the cosine among high frequency words. By the cosine of the sum of word vectors of topic words of the two texts,the method judged the similarity between the two texts. A text topic extraction and similarity calculation software system based on winForm was developed based on this thought. It is easy to realized and gets better performance than the text vector method.
出处
《现代信息科技》
2017年第4期20-22,共3页
Modern Information Technology
基金
乐山师范学院校级青年项目(Z1504)
关键词
文本主题提取
文本相似度计算
高频词
词向量
软件系统
text topic extraction
text similarity calculation
high frequency words
word vectors
software system