摘要
针对基于词频统计的TD-IDF文本特征提取方法缺乏对文本中概念关系处理,而使提取到的文本特征具有概念冗余、特征不明确等问题,提出基于本体概念相似度的词频统计方法。利用文本元素之间的语义相似度调整特征元素的词频,突出特征元素的语义贡献、消除特征冗余,增强特征集合元素的特征独立性。最后结合文本概念的共现特性,对可能出现某些重要特征元素因词频统计而被忽略的问题进行处理,从而准确、高效地提取文本特征。
Owing to the problem that the method that TFIDF text feature extraction based on word frequency statistic lacks the concept relations in the text, there are some problems in the text feature extraction, such as the redundancy of con cept and unclear feature. The method of the word frequency statistics based on similarity of ontology concepts is introduced. The frequency of feature element using semantic similarity between text elements is applied. It emphasizes the semantic con- tribution of feature element, eliminating redundancy of feature, and enhancing independence of the elements of the features collection. Finally, combined with the co-occurrence characteristics of the concepts of the text, it accomplishes to deal with ignored problems that some important feature elements through word frequency statistics lead to ignoring. Consequently, it achieves the goal that it can extract text accurately and efficiently.
出处
《计算机与数字工程》
2014年第11期2066-2068,2163,共4页
Computer & Digital Engineering
基金
国家科技重大专项(编号:2011ZX05023-005-012)资助
关键词
文本特征
词频统计
本体概念相似度
共现特征
text feature, word frequency statistics, similarity of ontology concepts, co-ocurrence features