摘要
网络上存在海量的中文文本资源,其中有许多具有稀疏性与不规范性,这令以统计词组方式来抽取文本关键词的方法表现不佳.基于百度百科本文提出一种中文网络文本关键词抽取方法,通过百科知识关系将文本从外延词条集合映射到能体现其内涵的语义主题空间中,再利用主题间的关系进行权值调整,最后通过Nave Bayes法回溯并找到原文的关键词.该方法有效地避开穷举词条的统计方式,能在很大程度上解决现有文本挖掘方法无法抽取网络词汇和新生词汇这一难题.在两个数据集上的实验表明,该方法在规范的文本和不规范文本上都有着较好且稳定的表现.
Based on words counting, the traditional keywords extraction methods are not able to work well on Chinese texts in the web, because many of these texts are spares and nonstandard. BaiduBaike is a rich and dynamic Chinese Encyclopedia which is closely relat- ed to hot spots and web popular. In this paper,we propose a new keywords extraction method for Chinese web text,which is based on BaiduBaike. In our method,the rich knowledge in BaiduBaike is used to map text into semantic topics from a set of Chinese words, and then the relationship among semantic topics is adopted to adapt the topics' weight in the text. At last the keywords of the text are extracted according to Naive Bayes. This method avoids counting Chinese words, and can resolve web words and novel words to a great extent. Experiments on two datasets have demonstrated that our method can get good and stable result.
出处
《小型微型计算机系统》
CSCD
北大核心
2014年第11期2422-2427,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61202298)资助
福建省自然科学基金项目(2012J05117)资助
中央高校基本科研业务费(JB-ZR1217)资助
厦门市科技计划项目(3502Z20133029)资助