摘要
信息主题的抽取是快速定位用户需求的基础任务,主题词抽取时主要存在三个问题:一是词语权重的计算,二是词语间关系的度量,三是数据维度灾难.在计算词权重时首先利用互信息确定共现词对,与词频、词性、词位置信息非线性组合,然后,根据词权重构建文档—共现词矩阵并建立潜在语义分析(Latent Semantic Analysis,LSA)模型.该方法借助LSA模型的奇异值分解(Singular Value Decomposition,SVD)将文档—共现词矩阵映射到潜在语义空间,不仅实现数据降维,而且获得低维度的文档相似矩阵.最后,对文档相似矩阵进行k-means聚类,在同类文档中选出词权重最大的前几对共现词,作为该类文章的主题词.对比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共现词抽取主题词的实验,该算法的准确度分别提高了19%和10%.
The extraction of information topics is a fundamental task for quickly locating users' needs,and there are three main problems in the extraction of the keywords,which are the calculation of the weight of the word,the measure of the relationship between the words and the data dimension of the disaster,respectively.When it comes to weight computing of the word,the mutual information should be used firstly to determine the covariate word pairs,which is with the non-linear combination of the word frequency,part of speech and the word position information.Then LSA(Latent Semantic Analysis)can be established,according to rebuilt document-co-occurrence matrix.With the SVD(Singular Value Decomposition)of the LSA model,the document-lexical space is mapped to the latent semantic space.This will not only lead to the data dimensionality reduction,but obtains the document similarity matrix with low dimension.Finally,using k-means,our approach clusters the similar matrix of the document,and selects the first few co-occurrence wordswith the largest mutual information as the keywords of the article.Compared with a method of extracting subject words based on improved TF-IDF(Term Frequency-Inverse Document Frequency)algorithm or co-occurrence words,our approach improves the accuracy rate by 19% and 10% respectively.
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2017年第6期1072-1080,共9页
Journal of Nanjing University(Natural Science)
基金
教育部人文社会科学研究项目(15YJAZH042)
山东省本科高校教学改革研究重点项目(2015Z058)