摘要
为了抽取文本中的信息,在分析对比了4种统计建模原型后,选用条件随机域CRF建立抽取模型,提出了一种文本信息抽取的方法。该方法对文本分析后加标注,确定文本特征集,采用有限内存拟牛顿迭代方法L-BFGS算法估计CRF模型参数,根据训练学习得出的模型,实现科研论文数据集头部文本信息的抽取。实验结果表明,使用CRF模型的抽取准确率达到90%以上,远远高于使用HMM模型的抽取准确率。
In order to extract the information from the text, a method based on conditional random fields (CRF) statistical model is presented. In this method, the text is labeled to determine the features space and one of the limited memory quasi-Newton methods called L-BFGS algorithm is used to estimate the parameter of the CRF model. According to the trained CRF model, various common fields from the research paper headers are extracted. The experimental result indicated that the precision rate of using CRF model achieved more than 90%, which is much better than that of HMM model.
出处
《计算机工程与设计》
CSCD
北大核心
2008年第23期6094-6097,共4页
Computer Engineering and Design
关键词
条件随机域
文本信息抽取
参数估计
L—BFGS迭代法
特征集
conditional random fields
text information extraction
parameter estimation
L-BFGS iterative method
features space