期刊文献+

基于中英平行专利语料的短语复述自动抽取研究 被引量:7

Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents
下载PDF
导出
摘要 短语复述自动抽取是自然语言处理领域的重要研究课题之一,已广泛应用于信息检索、问答系统、文档分类等任务中。而专利语料作为人类知识和技术的载体,内容丰富,实现基于中英平行专利语料的短语复述自动抽取对于技术主题相关的自然语言处理任务的效果提升具有积极意义。该文利用基于统计机器翻译的短语复述抽取技术从中英平行专利语料中抽取短语复述,并利用基于组块分析的技术过滤短语复述抽取结果。而且,为了处理对齐错误和翻译歧义引起的短语复述抽取错误,我们利用分布相似度对短语复述抽取结果进行重排序。实验表明,基于统计机器翻译的短语复述抽取在中英文上准确率分别为43.20%和43.60%,而经过基于组块分析的过滤技术后准确率分别提升至75.50%和52.40%。同时,利用分布相似度的重排序算法也能够有效改进抽取效果。 Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%-%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
出处 《中文信息学报》 CSCD 北大核心 2013年第6期151-157,174,共8页 Journal of Chinese Information Processing
基金 国家自然科学基金资助项目(61133012) 国家863计划资助项目(2012AA011102)
关键词 自动抽取 短语 语料 专利 平行 统计机器翻译 自然语言处理 抽取技术 phrase-level paraphrase statistical machine translation chunk parsing distributional similarity
  • 相关文献

参考文献11

二级参考文献166

共引文献376

同被引文献75

  • 1周强.汉语句法树库标注体系[J].中文信息学报,2004,18(4):1-8. 被引量:90
  • 2张艳,柏冈秀纪.基于长度的扩展方法的汉英句子对齐[J].中文信息学报,2005,19(5):31-36. 被引量:24
  • 3徐中一,胡谦,刘磊.基于CRF的中文组块分析[J].吉林大学学报(理学版),2007,45(3):416-420. 被引量:7
  • 4Erdmann M, Nakayama K, HaraT, et al. An approach for extracting bilingual terminology from Wikipedia [C]// Database Systems for Advanced Applications. Berlin, Heidelberg: Springer, 2008:380-392.
  • 5Bourigault D. Surface grammatical analysis for the extraction of terminological noun phrases [C]// Proceedings of the 14-th Conference on Computational Linguistics, Volume 3. Nantes, France: Association for Computational Linguistics, 1992: 977-981.
  • 6Justeson J S, Katz S M. Technical terminology: Some linguistic properties and an algorithm for identification in text [J]. Natural Language Engineering, 1995, 1(1): 9-27.
  • 7Ananiadou S. A methodology for automatic term recognition [C]// Proceedings of the 15-th Conference on Computational Linguistics, Volume 2. Kyoto, Japan: Association for Computational Linguistics, 1994: 1034- 1038.
  • 8Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: The C-value/NC-value method [J]. International Journal on Digital Libraries, 2000, 3(2) : 115 - 130.
  • 9Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition [C]// Proceedings of the 6-th Conference on Natural Language Learning, Volume 20. Stroudsburg, PA: Association for Computational Linguistics, 2002 : 1 - 7.
  • 10Lafferty J, Mccallum A, Pereira F C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// Proceedings of the 18-th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2001 : 282 - 289.

引证文献7

二级引证文献41

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部