摘要
压缩技术旨在模拟人类的文本概括和信息提取能力。句子压缩技术是自动生成能够保留原句核心内容的,合乎语法的,语义连贯的简短句子。文章分析了英文句子压缩技术中基于句法分析的Hedge Trimmer压缩技术,讨论了相关压缩理论,探索其压缩过程并用类C语言进行算法实现。提出了好的压缩句应该至少满足以下3个标准:第一是保留原句的核心内容,第二是具有正确的语法,第三是压缩长度合理。在算法的评估工作中,从DUC 2003语料库中选取了624个原始句子和对应的人工压缩句,与Hedge Trimmer压缩算法自动生成的压缩句进行对照分析。发现5种压缩效果不理想的情况,分析其原因并提出了改进策略。最后,通过实例对改进算法生成的压缩句和原来算法生成的压缩句进行对比评估,证明了改良算法能够获得更理想的压缩句。在英文句子压缩领域,改良的Hedge Trimmer句子压缩算法值得推广和应用。
Compression technology aims to simulate document summarization and information retrieval abilities of human. Sentence compression technology generates automatically short sentences which Capture the salient information of original sentences in a grammatically and semantically coherent way. The paper analyzes the Hedge Trimmer compression technology which is a kind of syntax-based technology of English sentence compression, discusses the compression theory and explores the compression process with the algorithm implementation in C-like language. The paper proposes that good compression should as least meets the following three standards: Firstly, it retains the main idea of the original sentence; secondly, it is grammatical; and thirdly, it is reasonable in length. In the evaluation work, we choose 624 original sentences and manual compression ones in the DUC 2003 corpus. Then we evaluate the automatic compression sentences produced by the Hedge Trimmer algorithm through comparison with original and manual ones. We find five situations, in which automatic compression sentences are not ideal. We analyze the causes and propose the improving strategies. At last, comparing the new automatic compression sentences with the old ones, we refine the algorithm to produce better compression sentences. The improved Hedge Trimmer sentence compression algorithm is ideal and could be popularized and applied in English sentence compression area.
出处
《沈阳师范大学学报(自然科学版)》
CAS
2012年第4期519-524,共6页
Journal of Shenyang Normal University:Natural Science Edition
基金
国家自然科学基金资助项目(71002015)
辽宁省教育厅高等学校科学研究项目(2009B066)
辽宁省高等教育学会"十二五"高等教育科研课题(GHYB110231)