摘要
为实现海量英文学术文本中缩写词及对应缩写定义的识别,本文提出了一种自动缩写识别算法MELearn-AI。该算法在人工标注数据集的基础上,从序列标注的角度,通过最大熵模型实现了计算机领域英文学术文本中的自动缩写识别。MELearn-AI在本文构建的评测数据集"Paren-sen"上得到了95.8%的查准率和86.3%的查全率,相对于其他两组对照实验的效果有较为明显的提升。本文提出的自动缩写识别方法能够在计算机领域的学术文本上取得令人满意的效果,有助于更好地理解并利用该领域术语。
In order to effectively identify the abbreviations and their corresponding definitions from enormous English academic texts, this paper proposes an automatic identification algorithm called MELearn-AI.In the perspective of the sequence labelling,MELearn-AI utilizes a manually labelled dataset and adopts maximum entropy algorithm to train a model, and then identify abbreviations in computer science academic texts based on the model. This method achieves a 95.8% precision rate with a 86.3% recall rate in the "Paren-sen" evaluation dataset created in this paper,it shows an obvious improvement compared to the other two algorithms.This paper proposes a method to identify the abbreviations and their corresponding definitions.Tested in English academic texts of computer science, the algorithm achieves satisfactory results, which is helpful to better understanding and adopting the terminology of this field.
作者
张秋子
陆伟
程齐凯
黄永
ZHANG Qiuzi LU Wei CHENG Qikai HUANG Yong(Center for the Studies of Information Resources of Wuhan University, Wuhan 430072, Chin)
出处
《情报工程》
2015年第2期64-72,共9页
Technology Intelligence Engineering
基金
国家自然科学基金
"基于语言模型的通用实体检索建模及框架实现研究"(项目编号:71173164)支持
关键词
学术文本
缩写
机器学习
序列标注
信息抽取
Academic texts, abbreviations/acronyms, machine learning sequence ,labelling, information extraction