摘要
是获取微博平台关键信息的一种重要手段。现有面向微博的自动摘要方法较关注文本集合中句子或者关键词的提取,而在去除冗余信息、内容噪声方面缺乏有效手段,导致提取的微博内容质量不高。为解决该问题,以微博平台为研究对象,提出一种基于时频域转换的信息提取方法,获得与某话题相关度高、冗余度低且信息量大的高质量微博文本,将综合分值较高的微博作为生成摘要的样本集合,并对该样本集合中每条微博的句子进行权重打分,选取权值较高的句子组成微博摘要。实验结果表明,该方法能够有效过滤冗余信息和内容噪声,基于自动评测和人工评测的摘要结果均优于现有自动摘要方法。
Automatic document summarization is an important approach to obtain key information of microblog platform. Most existing methods on microblogs automatic summarization pay more attention to extract sentences or key phrases from the set of documents, but there are few effective and commonly used methods on reducing the redundancy and noise, which results in the poor content quality of the extracted microblog messages and directly affects the performance of summary. This paper takes microblog platform as research object, proposes an information extraction method based on time-frequency transformation, and extracts a series of high quality microblogs which are highly related to one topic and with less redundancy and abundant informativeness. The sentences in the set of high quality microblogs are scored based on the weights of sentence characters, and the summary of microblogs is generated by ranking and selection of the sentences. Experimental results show that the method is effectively in filtering the redundancy and noise of microblogs,and the final summarization results based on automatic evaluation and manual evaluation outperform other automatic summarization methods' results.
出处
《计算机工程》
CAS
CSCD
北大核心
2015年第7期36-42,共7页
Computer Engineering
基金
国家自然科学基金资助项目(61070083)
2013年深圳知识创新计划基金资助项目
关键词
微博自动摘要
冗余去除
信息提取
自动评测
人工评测
microblog automatic summarization
redundancy removal
information extraction
automatic evaluation
manual evaluation