
基于词语热度的启发式中文句子压缩算法 被引量:1

Heuristic Chinese sentence compression algorithm based on hot word
摘要 传统的句子压缩方法多基于难以获得的"原句-压缩句"对齐语料库,因此提出了不依赖于对齐语料库的中文句子压缩算法。通过研究人工压缩结果并结合语言学知识,提出了词语层面和分句层面的两组压缩规则。算法在原句句法分析树和词语间依赖关系的基础上,使用两组规则进行压缩,同时为了保证压缩算法具有更强的适应性和准确性,引入词语的热度加强了压缩算法,最后通过句子整理和语法修复得到最终的压缩句。对比了人工压缩、只使用规则压缩和引入词语热度压缩三种压缩方法。实验结果表明,基于热度的启发式中文句子压缩算法可以在压缩比、语法性、信息量都损失较少的情况下,提高压缩句的热度。 Since the parallel sentence/compression corpora which most of the traditional methods based on are not easy to obtain, a linguistically-motivated heuristics Chinese sentence compression algorithm is proposed after studying traditional methods. By analyzing the human-produced compression and linguistic knowledge, two sets of rules are proposed, one is in word layer and the other is in clause layer. Two sets of rules based on the parse tree and the words dependence are used to compress sentence, and enhance the algorithm by hot word in order to keep the algorithm flexibility and accuracy. In the last step the compression result is cleaned and repaired. Human-produced compression, rule-only algorithm and hot word enhanced algorithm are compared then the results are evaluated in compression rate, grammaticality, informative-ness and heat. The experimental results show that heuristic Chinese sentence compression algorithm based on hot word can improve the heat of compression results without much loss in compression rate, grammaticality and informativeness.
作者 韩静 张东站
出处 《计算机工程与应用》 CSCD 2014年第4期132-139,共8页 Computer Engineering and Applications
基金 国家自然科学基金(No.50604012)
关键词 中文句子压缩 热词 语言学 句法分析树 Chinese sentence compression hot word linguistic parse tree
  • 相关文献


  • 1Jing H.Sentence reduction for automatic text summariza- tion[C]//Proceedings of the 6th Applied Natural Lan- guage Processing Conference, Seattle, WA, USA, 2000: 310-315.
  • 2Zajic D,Dorr B,Lin J.Single-document and multi-docu- ment summarization techniques for email threads using sentence compression[J].Information Processing and Man- agement,2008,44(4) : 1600-1610.
  • 3沈剑虹.RSS:信息整合传播的未来[J].河北大学学报(哲学社会科学版),2006,31(2):133-135. 被引量:8
  • 4Corston-Oliver S.Text compaction for display on very small screens[C]//Proceedings of the NAACL Workshop on Auto- matic Summarization (WAS 2001), Pittsburgh, PA, USA, 2001 : 89-98.
  • 5Knight K, Marcu D.Summarization beyond sentence ex- traction:a probabilistic approach to sentence compression[J]. Artificial Intelligence, 2005,139: 91-107.
  • 6Nguyen L, Shimazu A, Horiguchi S, et al.Probabilistic sentence reduction using support vector machines[C]//Pro- ceedings of 20th COLING, Switzerland,2004: 743-749.
  • 7McDonald R.Discriminative sentence compression with soft syntactic constraints[C]//Proceedings of 1 lth EACL, Trento, 2006: 297-304.
  • 8Hori C, Furui S.Speech summarization: an approach through word extraction and a method for evaluation[J].IEICE Transactions on Information and Systems, 2004, E87-D (1):15-25.
  • 9Turner J, Chamiak E.Supervised and unsupervised learning for sentence compression[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005 : 290-297.
  • 10Clarke J, Lapata M.Global inference for sentence com- pression an integer linear programming approach[J]. Journal of Artificial Intelligence Research, 2008, 31: 399-429.


  • 1陈琼.IBM:越来越寂寞的创新高手?[J].互联网周刊,2005(7):58-59. 被引量:1
  • 2鲁宏,黄鹏,崔政,李丽,谷雨.Web2.0时代的网络传播[J].河北大学学报(哲学社会科学版),2006,31(1):46-47. 被引量:25
  • 3黄世明.网络变革:RSS技术为书写Blog带来什么?[EB/OL].http://www.ccw.com.cn/soft/apply/network/htm2005/20050223_14WF7.htm
  • 4Frank.RSS时代开启Longhorn系统将大力支持Rs[EB/OL].http://tech.blogchina.com/146/2005-06-27/375527.html



  • 1Huffman D A.A method for the construction of minimum- redundancy codes[C]//Proceedings of IRE, 1952, 40 (9) : 1098-1101.
  • 2Ziv J, Lempel A.A universal algorithm for sequential data compression[J].IEEE Transactions on Information Theory, 1977,23 (3) :337-343.
  • 3Ziv J, Lempel A.Compression of individual sequences via variable-rate coding[J].IEEE Transactions on Information Theory, 1978,24(5) :530-536.
  • 4Storer J A, Szymanski T G.Data compression via textual substitution[J].Journal of the ACM, 1982,29(4) : 928-951.
  • 5Welch T A.A technique for high-performance data com- pression[J].Computer, 1978,17 (6) : 8-19.
  • 6中国国家标准总局.GB2312-80信息交换用汉字编码字符集[s].北京:中国标准出版社,1981.
  • 7GBK汉字扩展内码规范[s].1995.
  • 8王忠效.汉语文本压缩研究及其应用[J].中文信息学报,1997,11(3):57-64. 被引量:9
  • 9华强.中文文本压缩的 LZSSCH 算法[J].中文信息学报,1998,12(1):50-56. 被引量:12
  • 10常为领,方滨兴,云晓春,王树鹏,余翔湛.一种支持ANSI编码的中文文本压缩算法[J].中文信息学报,2010,24(5):96-105. 被引量:5










使用帮助 返回顶部