摘要
在利用条件随机场进行信息抽取时,单纯基于词或基于块的方法,不能充分利用上下文信息在恰当粒度上进行切分和抽取,因此提出了一种基于条件随机场的科研论文信息分层抽取方法,利用分隔符、换行符、行首字符等格式信息,结合条件随机场的特征函数,将文本切分成文本行、块或单个的词等恰当的层次,再采用L-BFGS算法学习模型参数并进行特定文本域的抽取。实验结果表明,该方法的抽取性能优于基于词或块的条件随机场模型的信息抽取方法。
Current information extractions from research papers based on CRFs just segment text into total blocks or words, so can not fully utilize the context information to segment and extract them in the proper granularity. This paper proposed a hierarchical information extraction from research papers based on CRFs. The algorithm made use of the format information such as list separator, new line character and line header character, and combined them with the feature functions of CRFs to segment the text hierarchically into proper lines, blocks and words. Finally on different hierarchy applied the CRFs to the extraction information in special fields. Experimental results show that the proposed method possesses better performance than that based on the CRFs siniply segments text into total blocks or words.
出处
《计算机应用研究》
CSCD
北大核心
2009年第10期3690-3693,共4页
Application Research of Computers
基金
重庆市科委自然科学基金计划资助项目(2007BB2372)
中国博士后科学基金资助项目(20070420711)
关键词
信息抽取
条件随机场
分层
information extraction
conditional random fields(CRFs)
hierarchy