摘要
文章基于"指代消解"、"文本外部特征"、"图排序"的混合方法实现多语言多文档新闻摘要系统。该系统由四大模块组成,分别为:原文预处理;基于文本外部特征的摘要计算;基于图排序的摘要计算与;摘要生成模块。首先对新闻文本进行预处理和指代消解,然后在第二模块使用文本外部特征(如:主题词、线索词语、关键词与其同义词、时间、地点、命名实体等)来计算原文中句子的重要程度,句子与新闻主题的相关度,并且对每个句子第一次打分。然后在第三模块采用图排序的算法再从原文中计算句子之间的连贯程度,并且第二次对每个句子打分。最后在第四模块通过两个不同算法的得分计算原文中句子的总得分并且按原文中句子出现的顺序摘出新闻文本的摘要。文章以汉语、英语、孟加拉语为例,实现该系统并进行摘要测试。实验表明系统能够从三个不同语种的多文档新闻文本中有效地摘出摘要。
This paper implements a multilingual multi-document News summarization system based on a hybrid method of "pronoun resolution","shallow document characteristics"and"iterative graph-based algorithm".The system consists of four major modules,which are respectively 1)Document preprocess,2)Shallow document characteristics based summary,3)Graph-based ranking and 4)Final summary generation.Firstly,system deploys the text preprocessing and pronoun resolution on multi-document news ware.Then shallow document characteristics(e.g.topic word,cue phrase,Keywords and their synonyms,Named entity,time,place etc.)based summarization method is applied to rank all the sentences in input document cluster for first time.Then system ranks the sentences in source documents again using iterative graph-based ranking algorithm.Finally the total score of a sentence is computed using these scores given using above two methods and final summary text is generated according to the sentence sequence in original text.This system has been tested using multi-document news text of English,Chinese and Bengali Language.Test results proves that the system is able to extract news summary from these languages effectively.
出处
《图书馆学研究》
CSSCI
北大核心
2013年第2期66-72,共7页
Research on Library Science
基金
中国科学技术信息研究所学科建设项目"自然语言处理"(项目编号:XK2012-6)的研究成果之一
关键词
多语言自动摘要
多文档自动摘要
图排序算法
文本外部特征
multilingual automatic text summarization multi-document automatic text summarization iterative graph-based ranking algorithm shallow text characteristics