期刊文献+

基于改进的TF-IDF算法及共现词的主题词抽取算法 被引量:17

A method of extracting subject words based on improved TF-IDF algorithm and co-occurrence words
下载PDF
导出
摘要 信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%. The extraction of information topics is a fundamental task for quickly locating users' needs,and there are three main problems in the extraction of the keywords,which are the calculation of the weight of the word,the measure of the relationship between the words and the data dimension of the disaster,respectively.When it comes to weight computing of the word,the mutual information should be used firstly to determine the covariate word pairs,which is with the non-linear combination of the word frequency,part of speech and the word position information.Then LSA(Latent Semantic Analysis)can be established,according to rebuilt document-co-occurrence matrix.With the SVD(Singular Value Decomposition)of the LSA model,the document-lexical space is mapped to the latent semantic space.This will not only lead to the data dimensionality reduction,but obtains the document similarity matrix with low dimension.Finally,using k-means,our approach clusters the similar matrix of the document,and selects the first few co-occurrence wordswith the largest mutual information as the keywords of the article.Compared with a method of extracting subject words based on improved TF-IDF(Term Frequency-Inverse Document Frequency)algorithm or co-occurrence words,our approach improves the accuracy rate by 19% and 10% respectively.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2017年第6期1072-1080,共9页 Journal of Nanjing University(Natural Science)
基金 教育部人文社会科学研究项目(15YJAZH042) 山东省本科高校教学改革研究重点项目(2015Z058)
关键词 共现词 互信息 语义分析(LSA) 奇异值分解(SVD) TERM Frequency-Inverse Document Frequency(TF-IDF) co-occurrence words, mutual information, Latent Semantic Analysis (LSA), Singular Value Decomposition (SVD) ,Term Frequency-Inverse Document Frequency(TF-IDF)
  • 相关文献

参考文献5

二级参考文献42

  • 1苏祺,昝红英,胡景贺,项锟.词性标注对信息检索系统性能的影响[J].中文信息学报,2005,19(2):58-65. 被引量:8
  • 2王树西.问答系统:核心技术、发展趋势[J].计算机工程与应用,2005,41(18):1-3. 被引量:27
  • 3耿焕同,蔡庆生,于琨,赵鹏.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162. 被引量:29
  • 4吴晨,张全.基于内容分析的中文问答处理算法及系统实现[J].计算机应用研究,2006,23(9):139-142. 被引量:4
  • 5Luhn H P. A statistical approach to the mechanized encoding and searching of literary information. IBM Journal of Research and Development,1957,1(4) : 309-317.
  • 6Luhn H P. The automatic creation of literature abstract. IBM Journal of Research and Development, 1958,2(8). 159-165
  • 7Salton G, Yang C S. On the specification of term values in automatic indexing. Journal of Documentation, 1973,29(4): 351-372.
  • 8Cohen J. Highlights: Language-and domain-in-dependent automatic indexing terms for abstracting. Journal of American Society for Information Science, 1995,46(3): 162-174.
  • 9Written I H, Paynter G W, Frank E, et al.KEA: Practical automatic keyphrase extraction.Proceedings of the Fourth ACM Conference on Digital Libraries. 1999.254-255.
  • 10Tzeras K, Hartmann S. Automatic indexing based on Bayesian inference networks. Procceedins of Interuational ACM SIGIR Conference Research and Development in Information Retrieval, Inference Networks. 1993, 22-34.

共引文献70

同被引文献161

引证文献17

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部