摘要
[目的/意义]研究机器学习中集成学习与直推学习方法对电子病历命名实体识别任务的性能影响,为基于机器学习方法的文本信息抽取提供一种性能优化方法。[方法/过程]首先对CCKS-2018提供的电子病历文本进行分析,提取中文分词、词性标注、临床实体类别特征;然后在条件随机场CRF算法下,采用不同输入特征组合的方法构造“基学习器”进行投票集成;最后用直推学习方法对集成学习结果进行优化。[结果/结论]实验中集成学习获得总体效果F1值86.93%均优于“基学习器”结果值,直推学习获得了模型的最佳泛化性能87.06%,同时多特征组合比单独字特征可以获得更好的“基学习器”。实验证明采用不同输入特征组合的集成学习和直推学习可以有效提升模型的泛化性能,该方法可以在其他相关机器学习与文本信息抽取领域中推广。
[Purpose/Significance]The paper studies the performance impact of Ensemble learning and Transductive learning methods in machine learning on electronic medical record named entity recognition,so as to provide a performance optimization method for text information extraction based on machine learning method.[Method/Process]Firstly,the paper analyzes the electronic medical record text provided by CCKS-2018,and extracts Chinese word segmentation,part-of-speech tagging,and clinical entity category characteristics.Then,based on conditional random field algorithm,different input feature combinations are used to construct"base learner"for voting integration.Finally,the result of ensemble learning is optimized by transductive learning method.[Results/Conclusion]In the experiment,the overall F1 value of ensemble learning is 86.93%better than that of base learners,and the best generalization performance of the model is 87.06%by transductive learning.At the same time,multi-feature combination can obtain better base learner than single character feature.Experiments show that ensemble learning of different input feature combinations and transductive learning can effectively improve the generalization performance of the model,and this method can be promoted in other related machine learning and text information extraction fields.
作者
孙安
于英香
罗永刚
孙逊
Sun An;Yu Yingxiang;Luo Yonggang;Sun Xun(Information and Archival Department,Shanghai University,Shanghai 200444;Library,Henan University of Science and Technology,Luoyang 471023;College of Medical Instrument,Shanghai University of Medicine & Health Sciences,Shanghai 201318;Qian Xuesen Library,Shanghai Jiao Tong University,Shanghai 200030)
出处
《情报杂志》
CSSCI
北大核心
2019年第10期176-183,199,共9页
Journal of Intelligence
基金
国家社会科学基金项目“大数据背景下档案数据管理理论重构、技术优选与实践创新研究”(编号:18BTQ092)研究成果之一
关键词
命名实体识别
特征提取
集成学习
直推学习
电子病历
named entity recognition
feature extraction
ensemble learning
transductive learningl
electronic medical record