摘要
博客已经成为了互联网上最热门的应用之一.博客文章内容千差万别,对其进行分类具有重要意义.博客文章有别于新闻文章,普通文本分类方法直接应用于博客文章效果不理想.提出一种新的方法,充分利用了博客文章特有的Tag、用户自定义类别等多个特征,并对各项特征进行融合.另外,通过对自定义类别进行预处理,过滤与类别无关的噪声单词.实验结果表明多特征融合的方法能够有效提高博客文章分类的准确率.
Blog has become one of the most popular applications on Internet.The content of Blog posts is various,so it's meaningful to have a research on Blog post classification.As Blog posts are different from News articles,common text classification methods doesn't perform well.We present a new method which is fit for Blog post classification in this paper.The method can make full use of the features of Blog post like Tag and custom category and fuse them.The noise words in custom category are filtered by pretreament.We find that the precision of this method is obviously better than common text classification methods.
出处
《小型微型计算机系统》
CSCD
北大核心
2010年第6期1129-1132,共4页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60672056)资助
国家"八六三"高技术研究发展计划项目(2008AA01Z117)资助
高等学校博士学科点专项科研基金资助项目(20070358040)资助
关键词
文本分类
博客文章分类
博客文章特征
多特征融合
text classification
blog post classification
blog post feature
multi-feature fusion