摘要
自动文摘是自然语言处理领域的一项重要的研究课题。文中提出了一种基于主题区域发现的中文自动文摘的方法。该方法的特色在于:产生的文摘能在尽可能全面地覆盖全文多个主题的同时,显著地缩减自身的冗余,从而能有效地平衡两者之间的矛盾。通过采用K-medoids的聚类算法联同新的自定义目标函数的聚类分析方法,实现了段落自适应聚类下的文本潜在主题区域的发现及其在自动文摘领域的应用。此外,一种基于表达熵的新的评价因子被用来评价摘要的冗余。实验结果验证了该方法的可行性,有效性,是对中文自动文摘研究的一种有意义的探索。
Automatic summarization is an important issue in Natural Language Processing. This paper has proposed a special method that creates text summary by discovering thematic areas from Chinese text. The specificity of the method is that the created summary can both cover as many as different themes and reduce its redundancy obviously at the same time. And the discovery of latent thematic areas under the adaptive clustering of passages is realized by adopting k-medoids clustering method as well as a novel clustering analysis method based on self-defined objective function. In addition, a novel parameter,which is known as representation entropy,is used for summarization redun- dancy evaluation. Experimental results indicate that this method is effective and efficient in the automatic summariza- tion literature.
出处
《计算机科学》
CSCD
北大核心
2005年第1期177-181,共5页
Computer Science
基金
中国国家语言文字应用委员会"十五"国家语委应用项目基金(ZDI105-43B)
湖北省自然科学基金(2001ABB012)
关键词
主题区域发现
中文自动文摘
聚类分析
表达熵
文本检索
Automatic summarization
Thematic area discovery
Clustering analysis
Representation entropy