摘要
为了适应Web新闻以指数趋势增长,传播迅速,且Web突发事件新闻在互联网上散布等特点,同时针对传统文本分类方法准确率和效率低,寻找特定主题的突发事件新闻信息难等问题,提出一种基于规则与统计相结合的Web突发事件新闻多层次自动分类方法。首先提取类别关键词形成规则库,然后利用分类规则将突发事件分成四大类,再用朴素贝叶斯分类方法将各大类突发事件新闻进行细分,从而形成了基于规则与统计的两层分类模型。实验结果表明,该分类方法的准确率和召回率都达到90%以上,分类效率也普遍高于传统的分类方法。
The Web news grows in index tendency and disseminates rapidly, and the Web emergency news widely spreads on the Internet. While the traditional text classification is of low accuracy and efficiency, it is difficult to locate the emergency news and information of specific topics. The paper proposed a multiple-layer classification method for Web emergency news based on the rules and statistics. First, it extracted category keywords to form the library of rules. Second, the emergencies would be classified into four major categories by the rules, and then these major categories would be classified into small categories by the Bayesian classification method, thus a two-tier classification model based on rules and statistics was established. The experimental results show that the classification accuracy rate and the recall rate have reached over 90%, and the classification efficiency is generally higher than the traditional classification methods.
出处
《计算机应用》
CSCD
北大核心
2012年第2期392-394,415,共4页
journal of Computer Applications
基金
国家自然科学基金资助项目(60873013
61070119)
北京大学计算语言学教育部重点实验室开放课题基金资助项目(KLCL-1005)
北京市属市管高等学校人才强教计划项目(PHR201007131)
关键词
规则
统计
突发事件新闻
多层次分类
rule
statistics
emergency news
muhiple-layer classification