摘要
短语复述自动抽取是自然语言处理领域的重要研究课题之一,已广泛应用于信息检索、问答系统、文档分类等任务中。而专利语料作为人类知识和技术的载体,内容丰富,实现基于中英平行专利语料的短语复述自动抽取对于技术主题相关的自然语言处理任务的效果提升具有积极意义。该文利用基于统计机器翻译的短语复述抽取技术从中英平行专利语料中抽取短语复述,并利用基于组块分析的技术过滤短语复述抽取结果。而且,为了处理对齐错误和翻译歧义引起的短语复述抽取错误,我们利用分布相似度对短语复述抽取结果进行重排序。实验表明,基于统计机器翻译的短语复述抽取在中英文上准确率分别为43.20%和43.60%,而经过基于组块分析的过滤技术后准确率分别提升至75.50%和52.40%。同时,利用分布相似度的重排序算法也能够有效改进抽取效果。
Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%-%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
出处
《中文信息学报》
CSCD
北大核心
2013年第6期151-157,174,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(61133012)
国家863计划资助项目(2012AA011102)