摘要
Blog(博客)可以称为在线个人日志。作为一种新兴的媒体,Blog目前已经成为一种在Web上表达个人观点和情感的一种非常流行的方式。那么如何从Blog中快速准确地抽取有用的信息(话题发布时间、话题题目、话题内容、评论内容等)就成为了Blog应用中一个非常重要的步骤。提出了一种基于模板化的Blog信息抽取方法,该方法通过分析Blog网站的HTML源代码,然后提取出网站的模板,并根据该模板对Blog网页进行信息抽取。对来自国内10个著名博客网站进行模板的提取,并对这10个网站中的7374个Blog网页进行了实验,实验结果表明,该方法能根据提取出的模板快速、准确地对Blog网页进行信息抽取。
Blog is called online personal diaries.Being a kind of rising media,Blog has become a prevalent way to express personal opinions and emotions on Web.So how to extract useful information (topic posting date,topic title,topic content,comments, etc.) from Blogs has become an imporiant step in Blogs' application.This paper presents an approach of Blog information extraction based on template.This approach generates templates of Blog web sites by analyzing source codes,and it then extracts Blog web pages according to these templates.In this paper,templates of 10 famous Blog web sites are extracted,and experiment results on a set of 7 374 web pages from these 10 web sites show that this approach can extracted information from Blogs rapidly and exactly according to the templates.
出处
《计算机工程与应用》
CSCD
北大核心
2008年第9期156-158,177,共4页
Computer Engineering and Applications
基金
国家自然科学基金(the National Natural Science Foundation of China under Grant No.60373095
No.60673039)
国家高技术研究发展计划(863)(the National High-Tech Research and Development Plan of China under Grant No.2006AA01Z151)
关键词
博客
信息抽取
模板
Blog
information extraction
template